DOCOHEIT fiESUHE 



ED 131 119 



95 



TH 005 845 



Cramer^ Elliot M.; Appelbaunir Mark I. 
An EYaluation of Some Hethods Used in the National 
Assessment of Educational Progress. Final Report. 
Koith Carolina Dniv. ^ Chapel Hill. l.L. Thurstone 
Psychometric Lab* 

National Inst, of Education (DHEW) ^ Washington^ 

D.C. 

76 

NEG-00-3-0111 
131p. 

MP-'$0.83 HC-'$7.35 Plus Postage. 
♦Academic Achievement; Analysis of Covariance; 
♦Analysis of Variance; Comparative Analysis; *Groups; 
♦National Surveys; *Statistical Analysis 
♦Balancing; National Assessment of Educational 
Progress; Nonorthogonal Analysis of Variance 



AUTHOR 
TITLE 

INSTITUTION 

SPONS AGENCY 

PUB DATE 
GRANT 
NOTE 

BDES PRICE 
DESCRIPTORS 

IDENTIFIERS 

ABSTRACT 

A recurring problem in educational research has been 
the adjustment of data to account for initial differences among 
observed groups of individuals on attributes uncontrollable by the 
researcher. The procedure called "balancing" is introduced in the 
National Assessment of Educational Progress report as an adjustment 
method for this purpose. Since it is apparent that balancing is being 
used extensively both in the NAEP work and in the analysis of data 
from state assessments^ this research aims at the development of a 
better understanding of the method and an evaluation of its strengths 
and weaknesses. The investigation of the nature of balancing required 
a detailed investigation of the nonorthogonal analysis of variance^ 
the fundamental concepts of marginal means and marginal populations^ 
as well as the investigation of balancing-like data analytic 
techniques such as "smear and sw^ep^" analysis of covariance^ and 
standardization, it was concluded that the general framework of 
nonorthogonal analysis of variance encompasses the most useful of the 
adjustment procedures when used in conjunction with the estimation of 
weighted marginal means. (RC) 
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SUMMARY 

The National Assessment of Educational Progress (UAEP) has had as its 
purpose the measurement of educational achievement in children and young adults. 
MSP Report T (1971 ) is of particular interest and in^ortance in that it 
characterizes the performance of blacks ^ of respondents vith differing levels 
of parental education, and respondents from differing types of comnmnity. 
The authors note that the report describes differences a^ they are and as 
they vould be in particular subgroups if the effects of other characteristics 
were represented proportionately in each subgroup. Since in a direct com- 
parison between group effects, one chajracteristic can masquerade effects of 
another, the method selected for comparing groups is of great importance • 
For example, on science exercises in the report there is a 20% difference 
between the extreme affluent suburbs and the extreme inner city. Because of 
the difference in parental education of the two groups, part of this 20% 
difference may be "considered to grow out of the difference in parental 
education." One voxald wish to compare the two groups as if they were: com- 
parable with respect to parental education. 

The procedure called "balancing" is introdiiced in the NAEP report as an 
adjustment method for this purpose. Little seems to be known about the pro- 
perties of the method beyond the brief description given in the report. Since 
it is apparent that "balancing" is being used extensively both in the NAEP 
work and in the analysis of data from state assessments such as the State 
Assessment of Educiational Progress in North Carolina, the development of a 
better understanding of the method and an evaluation of its strengths and 
weaknesses . is vital. This has been the principal aim of the research described 
in this report. 
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The investigation of the natiire of balancing has required a detailed 
investigation of the nonorthogonal analysis of variance > the fundamental 
concepts of marginal means and marginal popiaations, as veil as the investi 
gation of balanoing-^like data analytic techniques such as "smear and sveep, 
analysis of covariance, and standardization. It has been concluded that 
the general framework of nonorthogonal analysis of variance enconqjasses 
the most useful of the adjustment procediares when used in conjunction vith 
the estimation of weighted marginal means. 

The material in this report was prepared by 

Mark I. Appelbaum 
Elliot M. Cramer 
Lyle V. Jones 
Scott Maxwell 
Samuel Peng 
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Chapter I: Introduction 

In simreys one typically describes the vays in which particular groups 
of individuals differ. One would frequently like to know why the groiq>s differ 
and whether the differences might be ascribed to other variables which might 
be modified by educational intervention. The National Assessment of Educational 
Progress (UAEP), for example, has had as its purpose the measurement of educationcLl 
achievement in children and yoxing adults. Of particular interest has been the 
performance of blacks , of respondents with differing levels of parental educa- 
tion, and types of community. NAEP Report 7 (1971) describes differences 
as they are and as they would be in particular subgroups if the effects of 
other characteristics were represented proportionately in each subgroup. The 
method of coniparison is of great importance since in a direct comparison of 
groups the differences in one characteristic may actually be due to another 
characteristic* The procedure ca3J.ed "balancing" is introduced as an adjustment 
method for this purpose, apparently for the first time. It is described by the 
as follows : ^ 

"The unadjusted results as reported here and in Report k clearly and 
accurately estimate the differences in achievement between specific groups of 
children. For example, over all the science exercises, the median percentage 
difference between 13-year-olds in the Extreme Affluent Suburbs and in the 
Extreme Inner City is 20% (from Exhibit 6-1 ). Except for sampling error, this 
accurately reflects how these two groups differ. 

"However, children in the Extreme Affluent Suburb tend, more than children 
in the Extreme Inner City, to have better educated parents. Because of this 
lack of balance, part of the difference between these two groups may be con- 
sidered as growing out of the difference in parental education. Part, also, 
may be attributable to other factors on which the two groups differ. Some of 
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these factors have been determined for our respondents — tbe?lr sex, color and 
the region of residence • Many other possibly relevant factors have not been 
determined, such as the economic level of the children's parents and the cultural 
environment in the home. 

"It is natural to ask, 'V/hat vould the difference between these extreme 
types of community have "been if the distribution of Parental Education, sex, 
color and region had been the same for both types of community referred to 
above?' Were it possible to rearrange the world to equate these distributions 
for each type of community, the effects upon our nation and its schools would 
be profound. Such rearrangement is not possible. It is usually appropriate to 
think of the balanced results presented in this report as reflecting the dif- 
ferences we would see in the absence of masquerading by the other four factors . 
We can be reasonably sure the balanced results do a much better Job than the 
unadjusted results of reflecting such differences." 

Apparently the only Justification cixrrently available for the use of the 
method is contained in a ten page appendix of illustrative examples. The basic 
data treated in the examples are two-way tables of frequency counts giving the 
number of individuals in a particiilar cell who have successfully performed on a 
particular science exercise. This is illustrated in Example 1 where a random 
sample of 600 individuals is drawn from some well-defined population. The number 
of cases in each cell is representative of relative number in the population 
for the particular combination of conditions specified, and the degree of suc- 
cess for that group is estimated by the proportion of respondents giving correct 
answers to an exercise. 

From the two tables , one for numbers of observations and one for numbers 
of successes, the marginal values are row and column totals. 
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Example 1 

number of observations number of successes 

B B 





1 


2 






1 


2 


1 


100 


100 


200 


1 


50 


30 


2 


50 


150 


200' 


B 2 


30 


60 


3 


0 


200 


200 


3 




100 




150 


J+50 


6oo 




80 


190 



The problem of concern to the authors of the NAEP Report is that the 
marginal proportions may not be representative of the underlying effects. 
There is one sense in which these values are representative; the data are from 
a veil defined population, and the marginal proportions are estimates of propor- 
tion of success for that popxilation. However, if one wishes to get at an 
assumed underlying effect of extreme inner city uncontaminated, say, by the 
effect of parental education, these marginal values are not representative- 
Their introduction of balancing was an attempt to obtain representative values. 

The NAEP Report notes that interactive differences are not considered and 
balancing does not adjust for them; and also that, "The deficiencies of balancing 
are clear; it cannot be the final answer." Balancing will frequently involve 
estimation in a linear model that is known to be wrong, e.g., when there are 
• interactions present. Also there are other choices of weights, and although 
other choices do not affect differences between effects, they do affect the 
absolute magnitudes . We need then to develop a deeper xindei'standing of nonor- 
thogonal ANOVA which will carry over to the interpretation of balanced estimates , 
as well as providing insight into data analysis more generally. It should be 
noted that although the National assessment uses medians rather than means for 
estimation and uses special methods for estimating standard errors, the formu- 
lation presented here may perfectly well be used for estimating adjusted effects. 
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We will show that the method of "balancing -^e developed itx conceptxially 
qiiite a different way which makes clear that it is a special case of nonortho- 
gonal analysis of variance. The problem of interpreting balanced estimates 
then can he related to the more general problem of interpreting ad^^ted effects 
in the nonorthogonal analysis of variance. This more general prob^^^ has been 
of concern to us since there is not currently a Consensus of opini^^^^ on the 
proper methods of analysis for this more genera^ situation* This ii^efl^cted 
by the divergent suggestions we have received fi*om mathematical st^'^isticians 
regarding the testing of main effects by elimins-ting both interact^^^^s and 
other main effects, as opposed to eliminating o^ly other main effe<^'ts. Of 
coTirse such problems of interpretation arise in i'egression analysis* too. We 
have been concerned with this area as well. In ^ recent article ((^^atuerj 1972), 
misuses of regretr.sion analysis' were discussed, ev^n ^ome that had V^^ti published 
in The i\merican Statistician. 
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Chapter II: Balapcing and the Analysis of Variance 

The priinary aim of this grant, the explication of balancing in terms of 
the analysis of variance, is presented in this chapter. It is shewn that, 
vithout (question, balancing is intimately related to classical nonorthogonal 
AKOVA. The iii5)Ji cat ions for the interpretation of balanced results are 
presented. 



The educational researcher ea^aged in large scale multlfac- 
tor survey research may often be faced with a substantial statis-- 
tlcal problem whenever the number of observational units is not 
equal in each and every cell of an experimental or survey design. 
This situation may arise for a number of reasons ranging from th.e 
state of nature to the socto-pollt Ics of educational research. 
For whatever reason the nonorthogonality occurs, the statistical 
problem remains the same, namely that of being able to estimate 
the effects of the several states of nature uncont amlnated by one 
another. Simple methods of computing marginal means, marginal 
percentages, etc. will not yield the desired results. 

In an attempt to provide an appropriate method for assessing 
such effects, Tukey and his associates in the NAEP Ci971) studies 
have offered a method called balancing or the balanced fit. While 
this method does indeed provide the appropriate estimates of 
effects under a somewhat restrictive set of assumptions, it is 
presented in a manner which tends to obscure the meaning of these 
estimates in relation to well known statistical methods. Indeed, 
we shall show that the estimation procedure in balancing is nothing 
more nor less than the estimation procedure in the ordinary Least Square 
•estimation of a nonorthogonal main effects model analysis of variance. 

3 

An Example and an Incorrect (b ut Traditional ) Analysis 
tet us assume^ for illustration, that the following survey 
had been undertaken - first grade classrooms from three geographical 

3 

The data for this example were adapted from NAEP Report 7 (1971). 
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areas of the country have been randomly selectedt 200 from each 
8rea» and the method of teaching reading noted for each class - 
either "phonic method" or "sight method." Each student in each 
class is given a standardized reading test and the total class- 
room experience is rated a success if one half or more of the 
students in the class score at or above their individual age norm 
6n that test. The researcher is interested in assessing the 
effects of method of instruction and region of residency upon 
reading skills. 

Table 1 shows the number of classrooms ('^^ j ) observed in 
each cell of the survey. It is apparent from Table 1 that there 
are three times as many sight reading classes as phonic reading 
classes and that there are no phonic classes in Region III. Further- 
more, the design is unbalanced (nonor thogonal) since the cell 
frequencies are unequal and there is no constant of proportionality 
between the numbers in either rows or columns of the design. 

Let us assume that the world operates in such a way th-at the 
proportions of successes (classrooms in which 50% or more of the 
students operate at or above their age norm) are as given In 
Table 2. Within any of the three regions the pKontc method 
is 20 percentage points higher than the sight method, and within 
either reading method Region I is 10 percentage points higher 
than Region II which is in turn 10 percentage points higher than 
Region III. 

Now let us suppose that our researcher is "in luck" and the 
true proport ions given in Table 2 exactly reveal themselves in 
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Table 1 

MumlJe* Observations In Cells (n . . ) 





Phonic 


•Sight 






100 


200 




' SO 


150 


200 


Region ^rl 


0 


200 


200 




150 j 


A50 


600 




Table 


2 





True fttJJortion of Successes in Cells 

Phonic 'Sight 



Re^ic^n I 

Re^ic^n II 
Re^ic?»4 III 



.70 


.50 


.60 


.40 


.50 


.30 
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Table 3 



Observed Number of Successes, ^j^j • with Observed Proportion, 

of Successes In parentheses 

Phonic Sight 



Region I 



Region II 



Region III 





50 


( . 70) 


(.50) 


30 


60 


(.60) 


(.40) 


0 


60 


(-) 


(.30) 
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his data. I.e., the number of successes k^j yielding observed pro- 
portions p^j in each cell are as presented in Table 3. We can 
see that in each cell, save Region III phonic classes which are 
nonexistent, the proportion of successes Is identical to that 
given in Table 2. The researcher is, however , Interested in the 
differential effects of region and method of instruction, and so 
calculates the number of successes in each row and each column, 
finding the r :ilnal number of suc^'^gBcs given in Table 4. Upon 
conv g ci percentages (dividing the marginal number of successes 

by the marginal totals), he finds the percentages given in the 
column labeled "Marginal % Successes.'* He further notes that 
overall 45% of the classrooms meet the success criteria. Since 
he Is interested In the differential effects of region and 
instructional method, he then subtracts the total percent success 
(45%) from each of the marginal perce^r^-ages yielding the 
results presented In the column labe. "Differential %." 
These results indicate that the diffe nee between Region I j^-rrf 
Regt-ir II is 15% and that between II m-J III is 15%, while tas 
dlffa:rence between the instructional in ihods is about 29%. 
We know, however, from Table 2 that the effects are actually 11% 
for each region and 20% for reading method. There appears to 
be a contradiction. 

Th^?se results ilLLustrate the proiLem encountered when the 
eeXl fr^equencies in a design are unecgufiJL and dispropor tional . 
Clesrx/ we desire an analytic method which accurately reflects 
the (uirf f erential effects of the clas sif lea t ory variables and 

17 
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which can reproduce accurately the observed data from the estimated 
effects. The naive method Illustrated above does neither. 
The problem Is, essentially, that in the dlsproport lonal cell 
frequency case, one effect can masquerade as another. The 
example given Is particularly complex because the estimates of 
the region effect are confounded with those of Instructional 
method, while simultaneously the instructional method effect is 
confounded with region effects (i.e., neither set of estimates is 
free of the influence of the other). 



The Balanced Fit and Estimated Effects 

The method proposed in NAEP Report 7 (1971) for estimating 

the effect of one classification variable uncontaminat ed by the 

inf luen s of the othe:r in a two way cross-classification has 

been deHi^nadrasi the "balanced fit" by its authors. We find the 

fundamental gjrinciple of the balanced fit stated in the NAEP 

report as £o iljyws : 

Ife lirtend to find group effects (expressed in per- 
cer%^t«ge3) that , vhen combined by addition with each 
otl<»*r a3i(£ with the overall percentage of success, give , 
f ii tie^ P'errcentages o^ success that correspond with the 
aaisral. daca in on-e simple way: 

— if we choase any group by a single chaxacteris- 
tl^:^ group A, and if we use the fitted percentages 

and ttihc actual number of cases to calculate the num- 
ber alf sitccesses for each subgroup that involves (group 
A), smd If we then add these citilculated numbers of suc- 
cess^s^o toie total number of successes over all sub- 
grr3cpj5 wiJ.1 be the same as the total actually observed 
in rh^ ^arta. 

Let i&ke estimated group effects to mean differential 

/\ -A. 

row effecrs, say p -p , and differential column effects, say 

^ A. ^ 

P j"P»» wh re p.. is the estimated overall proportion of 
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successes. We may then write an expression for the estimated 
proportion of^succeoses in each cell as 

^ ^ ^ ^ A , ^ 

- P + (P< -P ) + (P .-P )• 

Since the observed number of successes for each cell in ^^jPij 
while the predicted number of successes is ^j^jP^j* the basic 
principle of balancing, that the sum of the observed numbers 
of successes equal the sum of the predicted number of successes 
then gives the row conditions 



i»l « 2 ^ • • • » I 

and similarly the column conditions 



or equlvalently 



j-1,2, . . . ,J 



(2) 



Since there are, in fact, infinitely many solutions to this system 
of simultaneous equations, two additional conditLoas jare intro- 
duced in balancing which make the solution unique 



Sn^ (p. -p ) " 0 

X • X • • • 

^(p .-p ) » 0 
• J • J 



(3) 



where n and n . are the numbers of observations in the ith row 

X . .J 

and Jth column, respectively • Applying the constraints in (3) to 
(Dp we see that 
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j and . 1 i-^iJ 



P - p 

X • n 



and that the p^^ and p^^ are marginal proportions using the weights 
n J for rows and n^^ for columns. As the NAEP Report notes, these con- 
ditions, (2) and (3), are sufficient to uniquely define the group - 
effects in (1) and hence to uniquely define the estimated marginal 
proportions p^^ and p We shall now show that (1), (2), and 

(3) have exact parallels in the nonorthogonal analys? r i^.ace. 

Estimation in the Nonorthogonal Analysis of Variance 
Re^aders familiar with the analysis of variance (ANOVA) will 
recognize certain similaxities between the survey design of the 
example and des±:g3is often analyzed by ANOVA techniques. We ahall now 
show that the estimates of differential effects obtained from the 
balancing procedure are e^cactly the same as those produced by a 
nonorthogonal analysis of variance of a main effects model. 
It should be empkansLized that we are dealing, at this time, with 
estlmacion in the ANOVA model, not the tests of significance 
which axe more commonly seen In ANOVA applications. 

This important relationship between balancing and ANOVA will 
be more easily seen if we adapt our notation and terminology to 
that commonly employed in the ANOVA context. In this case we are 
dealing with a two-way crosa-classif icat iom , often called a two- 
way factorial, with unequal and disprop ortlonal cell frequencies 
(a nonorthogonal factorial isiesign) . We* nx>w consider our first 
classification (factor), labeled A, to have I levels and the second 
cilassif ication (factor), labeled B. to have J levels. We 
will use the symbol y^jj^ to represent the score of the kth 

o 20 
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classroom under the 1th level of A and the J th level of B. 
Let v. . equal one for a success (If 50% or more of the students 
In the class score at or above their age norm) or zero for a 
failure; y^ji. Is then a binary random variable. The cell mean 

v" » E^-'v /n 

is simply the observed proportion* sj£ aUcce8buL> in the i,jth cell 
attd wrill correspond to p^^^ln our earlier notation. 

In the estimation^ phase of the analysis of variance employing 
a mafin ieffects model-, one attemT)ts to predict the i,jth cell 
mean: through the lin^ax model 



wliere the may be thought of as the estimated differential 
effect of the ith level of A, B as the estimated differential 
effect of the j th level of B, and y as a general or average 
edcfect about which the different ial ef fects operate . In the 
analysis of variance ve estimate the values of these parameters 
according to the Method of Least Squares, I.e., so that the sum 
of squared deviations of the observed scores from the predicted 
scores is a minimum. If we let p^^j Indicate that value of the 
cell mean predicted from the Least Squares estimates of the 
parameters for the i . j th cell, writing 

^ij " ^ «i + 

we may ohrtain the le^st squares estimates of the unknown parameters 
by minimising 
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Since we are predicting cell means, we weight each cell by 
the number of observations In that cell. To minimize S we may 
differentiate with respect to y , and 8j to obtain 

The equations in (5), usually referrred to as "normal equations" 
in the ANOVA context, do not themselves uniquely define u, a^^, and 
3^. They do, however, yield unique values of p^^^ ; that is, any 
set of values which Is a solution of (3) will yield the same 
values of the P^j* Substituting into (A), it follows that 

^ ^ ^ /V 

a ""Pv^ ^® uniquely defined regardless of which 

1 Ic 1 J Ic J . . 

particular set of a's are used. This result is easily generallzable 
to the fact that contrasts in the unknown parameters are unique 
for any solution of (5) • 

In order to obtain computationally unique solutions for 
these parameters it is the usiral practice in the analysis of 
variance to further constrain the system by the condition that 

E a - E B. - 0 . 
i ^ j ^ 

While this is the most commonly employed set of constraints, 
any other set of constraints will work equally well and will not 
change the meaning of the resulting solution so long as one considers 
only contrasts in the parameters. Given thlis freedom of choice, 
we prefer to use the constzraint s 
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vhlch are those commonly employed in the nonorthogonal ANOVA 
(see for example Winer, 1971). 

We nov p " equation set exac xy he same as 

the balancing c .1 set (2). Ix we further eqi^ate 



V - p 



equation (6) Is exactly the same as the balancing equation (3). 
Our basic ANOVA model (A) may then be written as 

so that we have an exact equivalence between (7) and (1) and 

hence between balancing and nonorthogonal ANOVA. 

Substituting (6) and (A) In (5) we can also show that 

SZn^.p^ 
\x ■■■■■ ^ » p 

as Is assumed In the NAEP Report. Thus, we see that the bal- 
ancing equations are but a special case of the least squares 
equations of a nonorthogonal ANOVA in a main effects model, 
and, in this sense, the two are equivalent. 

The correspondence between the balancing algorithm and that 
of the nonorthogonal analysis of variance makes possible the use 
of standard ANOVA programs which properly analy^ze nonorthogonal 
designs (e.g., Cra7ne:r, 1967) for obtaining balanced fits. Since 
current usage of the balancing technique has been limited to 
obtaining es tlmat ed cell neans and contrasts in main effect 
parameters there is no particular cioncern with tha constraining 
system employed since these solutions are invarl^t with respect 
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to the constraining system. If, however* one wishes to ^^taln 
the estimates of the parameters themselves it would be 
necessary to employ an ANOVA program whl^h allows for tli® 
specification of the constraints given (6)» Cramer*^ (l967) 
pTogram, for one, allows such a specif ic*^tion» 

Generalization to Higher Order Classifications and to. TiELS. 

pther Than Propor tloH s 

It can be shown that the generalization of balaflicitjg to more 
than two classifications Is equivalent t^ estimation in ^ higher 
order nonor thogonal ANOVA with a main effects additive ^^d^i. 
Thus, it is possible to produce estimate^ of effects bal^^eed ^or 
more than one interfering variable. Furthermore, there no 
need to restrict estimates to those of proportions. S±tx<^^ bal^ 
ajicing does not uniquely require data in the form of proportions 
(although it is nearly always so illustrated)* one could Equally 
well use the cell means of continuous response data In o^^^r to 
obtain balanced estimates of differentia^ effects. 

The Interpretation and Meanin?^ of B^i^nced Est jmatg g. 

When dealing with the nonor thogonal ^lialysifi of var^^ncc 
(of which balancing is just a special ca^^) careful atte^^ton 
must be given to the meaning and interpretation of estlnj^t^g and 
teffts of significance. Appelbaum and Cr^^^r (1974) have ^Iscussed 
the problems involved in tests of significance at some i^^gth* 
The critical problem in the nonor thogonaJ- Case is that tP^ 
eficcts of the several states of nature aPon the depende^^ 
vainiable in general cannot be estimated tested sepatra^^^ly ; 
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they are Inherently confounded • The exact manner in which such 
data are treated has very profound effects upon the meaning 
of the resulting estimates and the ways In which they may be 
Interpreted* 

A thorough understanding of the nature of the estimation 
procedure used In the nonorthogonal analysis of variance (and 
hence balancing) may be best facilitated from a consideration of 
marginal means rather than of the estimates themselves. 
The parameters estimated In ANOVA (the effects) are defined as 
functions of certain population means. It Is clear that all the 
Information available for the estimates of effects Is Included 
In the estimates of the marginal means* Recalling that the p^^ 
are themselves ciell means, the dlf feren<les .between effects which are 
of Particular Interest can also be expressed as differences 
In marginal means. For Instance, If we were Interested In the 
differences between effects of the first two levels of the A classi- 
fication, we would be interested' in (a--a^>i(li- -y )-(lio )«"(yi ^P*, ) 

In the process of selecting the way in which we produce the 
estimates of these differences, we are actually making . 
two quite different (and to some extent independent) decisions < 
One is fundamentally a question of what it is that we wish to 
estimate; the second is a decision of how to estimate that which 
we have decided to estimate. The first is a question of weighting; 
the second is a question of models and adjustments. 

When one does an experimental study, be it a true experimental 
manipulation or a survey, one considers that each cell of the 
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design represents a rsfndom sample from some conceptual population. 
In a two way classification, the true population mean of one such 

conceptual population would be represented as y • It Is these 

1 J 

and only these basic populations which have an Invariant meaning 
defined by the basic design of the experiment <» When we begin 
to Introduce the concept of a marginal mean Cas we must when ve 
talk of effects) we are adding a new conceptual dimension, for 
marginal means are weighted linear combinations of the basic 
population means. The way we choose to weight the population means 
In effect defines the marginal populations from which the estimates 
will be obtained. It must be understood that Cl) marginal 
populations have no reality beyond the nature of the basic 
populations and the way in which they are combined^ and (2) that 
the meanlnf? of the eistlmated effects will depend upon what weights 
are selected (i.e., the weights will determine what is being 
estimated) • 

A weighted mean is any linear combination of observations with, 
positive coefficients which sum to one. There are, of course, 
many different sets of coefficients with this property, implying 
that there are many different conceptual marginal populations 
which could be defined. There are, however, three basic types 
of wiilghtings which might be employed for a two way design: 
(1) equal weights, (2) singly subscripted weights, and (3) doubly 
subscripted weights. In order to understand the nature of these 
three weighting schemes, consider for the moment the situation 
in which we know the estimated population means for each and 

every cell in a two way design. Consider the construction of row- 
marginal means with each of the three weighting schemes. 
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In the first case, all weights employed would be equal so 
that the marginal population mean for the ith row of the experi- 
ment would be 



This type of marginal mean Is usually referred to as an unweighted 
marginal mean. In this case, each of the basic populatl.ons 
is treated as beinp identical to all other populations in its 
contribution to the marginal populations* In the second case 
the weights carry only a single subscript yielding row means of 
the form 



The several basic populations entering into the row marginal mean 
are differentially weighted, but the weights are the same for 
every row. These marginal means will, in general, be different 
from the unweighted means. For the third case the weights for 
each row will sum to one but they will differ from row to row. 
In this case the marginal mean for the ith row will be 



A question which must concern us is "for what situation will we 
want to use which of the various weighting schemes?" If in the 
example considered we were Interested in estimating the differences 
between the two reading methods as they are used throughout the 
coun try , we would be interested in differential effects based 
upon weighted marginal means, where the weights reflect the number 
of classes using a particular reading method in a particular region 
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of the coutry (a doubly subscripted weighting system) • If, on 
the other hand, we were interested in estimating the differences 
between the two reading methods as if they were equally used 
thrpup^hout the country , we would be interested in differential 
effects based upon equal weights. A third possibility would be 
to assume that the use of both methods was in proportion to the 
population in the various regions. This would imply that the 
same weights would be used for both methods, i.e., singly- 
subscripted weights. The choice of the weighting system is 
entirely up to the investigator, but the choice is not a trivial 
one. The selection of the weighting system basically defines 
what it is that the researcher is referring to. One further 
refinement on the nature of the weights, in the case of balancing 
will be added shortly. 

The Nature of the Weights 
Up to this point, nothing has been said about the nature 
of the weights themselves. In practice the weights may represent 
asi conceptual entity which the researcher deems important, say 
the relative cost of a treatment, the current social Importance 
of a particular segment of the population, etc. Surely the most 
common weights by far are the relative sample sizes. Insofar 
as the observed cell frequencies represent (are proportional to) 
the actual population sizes, weighting by the cell frequencies 
may be logically sound. In those cases where the observed cell 
frequencies do not reflect any true state of nature, or when the 
populations are considered to be infinite, such a weighting scheme 
can make little If any sense. 
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At this point a word of caution seems to be In order. 
It Is important to distinquish between weights as we have defined 
them above and the coefficients of the "normal equations" in (5) 
and in the constraints of (6). The weights are defined for the 
purpose of constructing marginal populations; the coefficients 
of (5) and (6) are the result of the Least Sq^uares criteria and 
are completely independent of considerations relating to the 
definition of marginal populations. 

The Problem of Estimation 

Having decided uPon a weighting scheme and thereby defining 
marginal populations and a potential set of effects to be 
estimated, one is left with a second, althou^ih not totally 
independent question of how to do the estimation. Clearly, If 
we possess unbiased estimates of the individual population means 
we can easily obtain unbiased estimates of the marginal means no 
matter how they are defined. Since linear combinations of unbiased 
estimates of population means produce unbiased estimates of the 
same linear combination of pbpulation values, we may always 
obtain the desired unbiased estimates. Thus, the problem of 
estimation reduces to the problem of how to produce unbiased 
estimates of the individual population means. 

Whenever one establishes estimates of parameters, say- 
population means, one is always operating within the context 
of a model; the nature of the obtained estimate depending up6n 
the model in which it is estimated. In the two way classification 
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scheme there are five reasonable » but different models which might 
be used to estimate a typical population mean ^j^^j * These are: 

1) - M+ci^+3j+cx6^j (the Interaction or cell means model) 

11) » M+a^+3j (the two main effects model) 

111) « ^'^^^ (the main effects A model) 

Iv) « M+Bj (the main effects B model) 

v) « p (the grand mean model) • 

When an experimental design is nonorthogonal and when the Least 
Squares estimation procedure is used for obtaining the estimates, 
very different estimates of the population means will obtain for 
estimation In the different models and, as a consequence, 
different estimates of the marginal means and differential 
effects will result. 

If we choose to estimate the individual population means in 
the first model (the Interactive or cell means model), the ordinary 
cell mean, ^±j^ obtained as the estimate. y^^ is always an 
unbiased estimator of the population mean ]x^^ without regard to 
which model obtains in nature. If, however, one of the simpler 
models should be the true model, the variance of the y^j*® will 
be larger than the variance of the unbiased estimator resulting 



4 ^ 

Some authors have suggested other possible models, e.g. 

Problems involved with such models have been discussed elsewhere 
(e.g. Appelbaum & Cramer , 19 74) and are not considered here • 
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from estimation In the correct model. Thus, y-=j' while always 
prov^ldlng an unbiased estimator, will not be the minimum varisnc^e 
unbiased estimator unless there is truly an interaction ber«?es2i 
the cl«^«sif Icatlor ff^-^-zr^^^. 

^tWefesi^ estimatiLoir pr=Kc:eeds frrom the ^ecorrd model (the two 
tttaix ^^fSs^ts model), oz^r obtaini3 the ba^aace^ fit 
carricwt;es of the population means (or ;vaLlently the main 
ef fee ANOVA estimates) . These es timavte^ axe :snb:iased only whsst 
one m£. che non-interacrion mod&ls (ii, i x» Iv, or v) holds in 
nature and vlll be mic=2iiura variance unb_-ised only -when model ti 
holds. Thus, estanates from model ii, often called adjusted 
estimates, are appropriate only in the non-interactive case. 

In a similar fashion, estimates based on models ill, iv, 
and V will be unbiased estimates only when the corresponding 
models obtain. These estimates provide minimum variance unbiased 
estimators only when the particular model holds. 

One is free to select an estimation scheme based upon one's 
belief In the state of nature, but one must always remember 
that this choice will simultaneously affect the resulting 
estimates both in terms of their unbiasedness and variance. In 
order to obtain unbiased minimum variance estimates one must 
estimate in the model corresponding to the true state of nature. 
Should the model selected be too simple relative to the true 
state of nature, the estimates will, in general, be biased; 
should the model be too complex, the estimates will not be 
mlnimuM variance- 
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"Che tmtexrsBectlon of WelghtlnR Schemes and Es tliaatlon 
Asiy p oec^dure for obcialning es tlwat of effects In an n^sc^y 
layout cap nw viewed as the inters^etr-tion of a welthtting 
schene rn^ ^ aiiSl :aation procedure, and its properties ^ay be 
better undsaratao^c by examining the consequences of th^ Individual 
conponents. B^lisr cing is, in this view, the intersect:! on of a 
singly subHir Ti^ytes^ weightdLng system with model ii estincation 
(two main ef^'^c^Zt^fi siodel). Thus, the baianced estimates of 
differentia \tt..^jizz3 are estimates obtsnxned employing singly 
sabscripteci .^iivii:^: for each population and by assuming no 
interaction , ^jotcnsr; the classification dimensions. 

It is, ii,we=^«r, possible to view balancing as the Inter- 
section of e weights and model ii estimates. This indeter- 
Alnacy occnrrB zrarcause of an interaction between the weighting 
system and es T ^msr f ion system employed. This result, which has 
major implljrac-l0Tt3 for the interpretation of the balanced fit 
may be under3c:sa)d more easily by returning to our initial example. 

The NAE? invr^s^d. gat errs discuss balancing in terms of making 
comparisons b^srtsssi two gxioups as if the groups were identical to 
one another in : :^-in 3 of their compositions on other (interfering) 
variables. This goal clearly implies a singly subscripted 
weighting system. In terms of our initial example, this amounts 
to asking aboxit tifcsa. differences in reading method as if tbsy were 
used in the mmtoei prroportion in all three regions of the cotuntry. 
Thus we woul<! be ijxterested in the column marginal difference 
with the row- weighted the same for both columns; i.e. we are 
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fimtenescred dn 

and ^2 " Vl2 ^2^22 + ^3!^32^ 

Tl»e Vj^'s are the weights to be applied to essr^n region, and may, 

far liastanae, reflect the relative sizes of triie regicm in 

te:ra8 of the number of first grade classrooza^. ^Je nmy now write 



^l ' ^,2 " ^l - ^.2 • ^^1^11 + ^2^21 - ":3^3i^ - 

^^1^12 ^2^2 + V32>- ""fll - ^12^ + ^^IZl - ^22^ + 
^3^^31-^32>- 

The Implication of the nonlnterac t Ive model employed by the 
balancing system, however. Is that all row differences must be 
equal across columns and that all column differences mcsst be 




We further note that the weights must be chosen to sum to 1 by 




The Implication of this result Is that It makes 

absolutely no difference how the subpopulat Ions erre \??s:lghted In 
constructing the marginal population In the balanced solution as 
long as Eihey are weighted the same for each popmikatlon . This 
Implies that the true relative sizes of, say, r^-g-^on^ of the 
country do not enter Into the assessment of the mEsthTsds difference. 
Since eqnal weights are but a special case of sizr^ly subscripted 
weights they could equally well be used. We thersorre conclude 
thfft In balancing It makes not the least bit of difference 
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Mhetxfen: we equally we±\n the ^ia^bgixroups or veighr - Kem dif f eiren:ri- 
^12i~ in r&e sense o£ sx.': ly sxttis.cir:ipted weights. 

The ' tradlt ional fem; inco.neiic analysis" prtast^ated as the 
filTsrz. example In this prap^er I5 ccm arxample of an — ighted 

vlidi model Lii e^itimatise . bcained for che; t:^-^.-^s and model 
iw estiro- it::es obtaiine:d for the zaiiuiaiis. Tfae incarca: tness of thius 
anaivsi^ aTises from the fact rhat: we are applyr-.iis £- one main - 
effect model when inideed there are^ ^wo main effec±^s. The row 
margd.naL nteans are olntaxned firtjm £ model p^^ » ^^^± ^^^^'^^ 
^+^^*''n^jPj, J while the column marginal means are oht^srined from 

a mo:del p. where *Ea . . p . . . 

i-3 i i ^ ij ij 

Conclusion 

It has been shown that the estimation proceidrur^ 
empIjDyed in balancing is nothimg more or less than t±j22 Least 
Squares estimation cf effects in a nonor thogonal ma±ii effects 
model analysis of variance. In assessing the appropxiLateness 
of this method o f analysis for a particiulax study ^ one must 
consider tha appropriateness of :the two component: elements: first 
that of the weighting scheme empiloyed and seciond that of tfae use 
of the mfflln effects model. 

Sal:artclng, it ::i.a6 been seea, can be viewed as eTmploying 
eltheT singly suSssKcrlpts^d. weight::s or equal wel^ghts; the resiul^ts 
be±ng^. in^v-ariant zzo this selecCixaa. It should be noted that trdiese 
ars TW^X tihe only possibais sichemes, nor the oxx&b :EEfi c:£SB;arily 
desirisd* One could altsimart iveirr use the ceXl esrimates obxr^ained 
for t±ie solution of the balancirrg equations, but rihen use uEssqual 
weigitts to define marginal population means. The selection is up 
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the researcher and depeftdri^ only upon vhat it is that he wishes 
to es.jz^maLXe. 

Tse assecsiaent of .crti^ use of the main effects laodel is a 
soBJievJira- miore difficult 3ue dite to the fact that the appro- 
praaiten^as of the model iiepends upon vhat is true in nature, mot 
starp^ly -^?^3i vhat we wouii^ like tD be true. In many ANOVA 
apiiiLiiCTs^liona LhL^ -±a noti a Jiar.ticnlar .probiemof or one often 
te^sz:s vixe: s igmif Icance ar.z the interaction prior to estimation in 
order determine what the proper model. In using balancing, 
hawewex. one Is at tie rrmtscet assuming that the main effects 
mode-l is apiDrxnpriate. Tiiie basic implication of this assumpliioii 
is trkat we are assuming :no differential effect of one state of 
natuxre iroamit±^onal upon ainother. xn our example we are, for 
tnsts^ice, assirming that the difference b^^tween the efficiency 
o£ pih'XiRi- and sight meth-ods of reading instruction is the same 
iin snach mid e^exy region under study. The tenability of the 
cTOir-inter::H:ctiii3ii assum^trion is, of course, completely dependent 
uiTirm the particular nrrndy under consideratl and no general 
ruXes can be f::jrra^d r^^T saying a priori wh^ the assumption holds. 
r^£re -way be c^t^±zi :r:i±rcums t ances under which the additive model 
t ^sr??rrflrpria2:e, h^ ir weuld seem, in general, to be a dangerous 
a^s^EETOfftprion tn: iroTiTrir.n&ly snplioy. 
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Chapter III: The Nonortiiogoi;i=^ AMQVA 

Central to the \aiider3taiici:.ag crT balancing is the concept of the nonror- 
thogonal ANOV^. The following- -hErmer serv^es tc iUumnate the fundamental 
conceptF the nonorthogonal a~si^ and tc resoave a nunher cf the coctro- 
vexsies snrroimding this g-x^nei-a.- t:-r:=Lc, 
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The nonorthogonal , multif actca? analirsis vsrienc^e i£BOYL^) is , perhaps , 
the singly most mistinderstood anaOytic te'::i:hn±que available to tie behavioral 
scdentist save factor aneOys^s, Standard t&rLbooks all tirt igirore it, or, 
vhfin they do consider this case, bury it swh confused jnathematics or 
approximations as to make ±- barely underst:sa2idable to even rsther statistically 
sophis-fcicated researchers. Itec^nt journal err::ic:iLes (e-g., Joe, 1971; Overall 
and Spiegel, 1969; Rawlings... 11972; Verts and Idnn, 1971; and Willi ams , 1972) 
have attempted to clarify tirs sitaation and set guidelines for t^e analysis ' 
of nonorthogonal , miatifact;^ experiments, fct have, in our opinion, done 
neither. These papers ha^-re, ^SigaAn, confizsfed -the issues i>rit:b ^inirecessary 
mathematical proofs, vith eeoitiquated "apprrn-jate" methods,, snd mth the 
implication that somehov ncxncyrwhogonal -lesiisris etb 8p>:ecn^ cases to be 
avoided at all costs . So svr 'dTi^ is: the beJir r tixat -tnrere r::^^ sonrething 
inherently *'diffi:;zult*' or ''str^isnge"^ ab-irt trne noucrt3iogS].r:2iL ^aae that experi- 
menters will, on accasiom, -z i:nusu-l Ltni^ths , suer. r^^radoimly discarding 
data from selected cells., to -achiz^^Te an nrtTingnr-ir' zLesign. 
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He wish to argue that there Is no conceptual difference between orthogonal 
and noxn--orthogonal ANOVA and that, indeed, the orthogonal design is a special, 
and occasionally artificial^ case of the more general non--orthogonal design. By 
approacSiing the entire issue of the analysis of variance as one of model com- 
parisons the special problems encountered in the non-orthogonal case are rather 
easily understood and resolved* The closely related problem of deletion of 
variables in multiple regression analysis has been discussed by one of the 
authors (Cramer, 1972)- By treating the problem as one of comparisons of linear 
models he has resolved the issue in a clear and fairly obvious manner. We believe 
that a similar approach with non-orthogonal designs will lead to the same resolutioi 

The easy access to sophisticated computer programs which perform the 
ana±ysi:Li of variance by a general linear model approach (e.g., MANOVA; Cramer, 
1967) xsakes the computations for this method of dealing with non-orthogonal, 
caultif Sector designs possible and eliminates, in most cases, the need and 
desirability for "approximate" solutions. 
Terming 21ogy and basic concepts 

Before proceeding with a detailed discussion of non-orthogonal analysis of 
variance, ic is necessary to" clarify some of the terminology and concepts that 
are fundamental to these analytic techniques. A non -orthogonal design refers 
to any experimental design in which the numbers of observations are not equal 
in each and every cell. This definition encompasses even designs that are 
traditionally classified as proportional and includes designs that are not com- 
plete factorial. Insofar as an experimental design may be considered *a 
partially complete factorial design (e.g.. Randomised Blocks, Latin Squares, 
nested or hierarchical designs, etc.)* the principles discussed in this paper 
apply. 
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We shall use the term method to refer to the estimation procedure which 
ve shall assume to be the Method of Least Squares* The concepts developed and 
discussed in this paper apply only to Least Squares Analyses and should not 
be applied to non^-exact approaches such as the Unweighted Means Analysis (Winer, 
1971; A45-449) nor to cases which employ some other method of estimation of 
effects* 

By an experimental design we shall mean the plan of the experiment determined 
by the experimenter on the basis of his conception of some Idealized state of 
nature* The minimum requirements for an experimental design are the specification 
of the experimental factors to be manipulated and the plan for random assignment 
of experimental units to treatments, including both the sampling plan and the 

determination of the number of units per treatment* It is the experimental design 
which implicitly specifies a set of possible models or idealized states whose 
appropriateness we shall attempt to assess* 

Hypotheses or tests of hypotheses are> in essence, comparisons of various 
models. It is fundamental that one understand that, within the analysis of 
variance, one is always trying to assess the appropriateness of one model 
in comparison to another one. To stress this point, we shall often refer to 
significance tests as model comparisons . Unfortunately, the standard approaches 
to the analysis of variance in most introductory courses overlook this con- 
sideration and have led to much unnecessary confusion. 

Finally, one must carefully consider those situations which may produce 
a non-orthogonal design. First there is the case where the design is inten- 
tionally planned as non-^orthogonal and is executed as planned. Such designs 
are reasonable and may be preferred in cases where contrasts of particular 
cells are desired, or where greater precision of estimation is required in 
some cells than in others. Similarly, some experiments, particularly those 
involving concommitant variables as factors in the design, may be planned 
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as non-orthogonal in order to allow naturally occurring differences in cell 
frequencies to manifest their effects in the resulting tests. While these 
designs are rarely encountered in psychological studies, they do have appli- 
cations arid present no particular difficulty in terms of "proper" analysis and 
interpretation. The discussion of the non-orthogonal analysis of variance which 
follows is directly applicable to those designs. The second, and far more 
common, case occurs when a design (orthogonal or non-orthogonal) is not exe- 
cuted as planned. That is, once the random assignment of experimental units 
to treatments has Leen made, data are not obtained on some units. In this 
second case, one of two different situations may have occurred and, depending 
upon which is true, rather different approaches are required. It may be that 
the "cause" of the loss of experimental units is a random phenomenon or 
one unrelated to the experimental treatments. Death of experimental animals, 
"no-show" of college subjects, etc. often may be viewed as essentially 
random phenomenon. Again we have no particular problem for we are, in effect, 
left with a random sample of a random sample which is itself a random sample, 
and the niethods of non-orthogonal analysis of variance to be discussed still 
apply. 

The situation that may cause considerable difficulty is when the "cause" 
of loss of experimental units cannot be considered a random phenomenon (e.g., 
it may be related to the experimental treatment). This situation may be 
obvious, as when the combination of treatments cause the death of some experi- • 
mental units; or in may be subtle, as when one set of treatment combinations 
are run late in the afternoon causing an increase in the no-show rate. In 
such a case, there would seem to be no remedy short of pretending that the 
missing observations are random, and hoping that the results will be reason- 
able. Perhaps the definitive statement was made by Cochran and Cox (1957, 
p. 82) when they observed that the only complete solution to the problem 
^ of missing data is not to have any. The following method leads to correct 
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analyses and interpretation of designs which, are (1) planned as non~orthogonal 
or (2) which become non*-orthogonal due to the random processes of nature. 
Models and the method of least squares ♦ 

Having decided, either by choice or default, to employ the method of least 
squares and having determined the design of the experiment on the basis of a 
belief as to the nature of the world (in some idealized sense), one is left 
only with the selection of possible models and model comparisons. We first 
note that the models selected are logically independent of the observed 
numbers of observations per cell. While obviously the analysis will be 
affected by the cell frequencies, the exjperlmenter is free in the design of 
the experiment to choose the numbers of observations per cell, constrained 
only by considerations of efficiency and convenience. The models themselves, 
the representations of our belief in the nature of the world, are not expressed 
in terms of the number of units in any subpopulation — indeed, the models being 
considered are usually in terms of infinite subpopulations. Since the model 
itself is free of population size, the cell frequencies can hardly matter 
in terms of the correctness of the model. But then why all of the concern 
about non-orthogonal , analysis/ 

The problem of non-orthogonal analysis really occurs at the level of 
">odel comparisons and proper interpretation of the results of such comparisons . 
As we shall see, the difficulty arises from the methods available to assess 
the "correctness" of the several models being compared. 
One-way ANOVA 

Let us ¥irst consider a k-group one-way ANOVA with n. observations in each 
group; a design which is usually thought to offer no problems, even with unequal 
cell frequencies. It is our goal to make inferences concerning the population 
means in the several treatment populations. These inferences will be based 
upon the observed sample means, the best unbiased estimates of the population 

Y:rnr>^^^^^ ^^^^ additional assumptions about the populations beyond those 

?^""of normality and homogcnity of error variance. ^ | 
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One-way AMOVA 1b commonly treated as the comparison of two models 

(I) Yy - M-h.j+ey 

(II) ty-M+ey 

If model II Is the correct model, the means for the several populations must 

be exactly the same and the best unbiased estimate (the least squares estimate) of 

each of them is the common mean of all the observations. This estimate has 
2 2 

variance a /2^j> where a is estimated by the common-*within-cell variance. 

If model I is correct, the best unbiased estimate of any population mean is the 

2 

sample mean for that population which has variance <j /n. . The best unbiased 
estimate of any difference (contrast) in the population means is the difference 
(contrast) in the sample means. This is true regardless of the number of ob- 
servations obtained from any population since knowing one population mean 
tells one nothing about any other population mean. (Note that if Model II 
is correct, estimates of means obtained from Model I are unbiased but have 

variances which are larger in the ratio Sn,/n ). 

k«l * J 

' The number of observations obviously does effect the variance of the 

estimates of population means and must also affect the power of any tests 

of significance. For any two groups (j and k) the v^Lriance of the meaa diff- 

2 2 

eronces is the weighted sum of the variances (of means) a /n^o /n^^ If 
the total number of observations for the two groups is held constant, this 
variance is a minimum when the cell frequencies are equal and the power of 
the test of the difference of these population means will be a maximum in 
this case. Similarly it can be shown that the power of the test of equality 
of all the population means is a maximum when all the cell frequencies are 
equal. Thus the effect of non-orthogonality in the one-way ANOVA is in terms 
of the power of the test — not in the obtained estimates nor in the test of 
their significance. ^2 
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Two-Way Analysis of Variance 

The situation is not nearly so simple when we move to the case of factorial 
experiments. The additional problems encountered In the factorial case are 
Illustrated by the following example. Intentionally constructed to represent * 
an extreme case« 

Consider a two-way ANOVA for which we have observed the cell means x^^ 

with the cell frequencies n^^j as given in Table 1. Assume that the estimated 

withln-cell (error) standard deviation is 15 In each cell (I.e., MS -225i. 

error ' 



Table 1 

Cell Means and Frequencies for Two-way Example 

Cell Frequencies, n 
B 

1 2 



Cell Means, x 
B 



2 



ij 



1 


10 


10 


1 


25 


2 


2 


20 


20 


A 

2 


2 


25 



As an exercise, let the reader consider the answers to the following 
questions before proceeding further: (1) What can one say, given the above 
information, about the presence of any main effects or interactions in this 
experiment? (2) Given the answer to this question, further consider what 
one would suppose to be true of the populations? 

In our experience, relatively sophisticated psychologists and graduate 
students will not necessarily answer these questions in a consistent manner. 
We believe that the customary training in psychological statistics will lead 
many to base their answers to the first question on the means alone judging 
that there is an A effect but no B effect or interaction. The obvious in- 
equality in the numbers of observations per cell will be troubling and the 
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sophisticated respondent will certainly observe. In response to the second 
question, that the main diagonal means are much more stable than the off- 
dlagonal means. 

If one asks the further question, VWhat would and should an ANOVA 
analysis tell you about the true populations?", ve are at the heart of the 
problem of non-:orthogonal ANOVA* Surely any ANOVA, orthogonal or not, must 
give Information about the population means. It Is not reasonable that un- 
equal numbers of observations In cells will alter the character of this In- 
formation, although it will certainly alter the precision of any statements. 

Looking at the sample means of Table 1 It Is apparent that If the popu-- 
lation means are the same as these sample means (and this is our best guess), 
there is only an A effect present, Our statements must, however, take into 
account the sampling variability of these sample means ^ Consider, for a 
moment, the 95% confidence Intervals (Table 2) which might be generated about 

the four observed sample means. Since the samples themselves are Independent 
random samples from four possibly different populations, the confidence in- 
tervals are, in the same sense, independent. From these confidence Intervals 

Table 2 

95% Confidence Intervals on Sample Means^ 



1 


1 


2 


3,97<y^^<16.03 


-ll,32<ii^2l^l,32 


2 


-l,32<ii2i<^1.32 


13,97<ii22l26.03 



These confidence Intervals are based upon the pooled MS error with 50 d.f . 
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we may see It is reasonable that our sample could have come from a set popu- 
lation with any of the following patterns of means (Table 3): These are but 



Table 3 

'"Reasonable" Population Means 

B B 





B 


1 


2 


10 


10 


20 


20 





1 


2 




1 


2 


1 


10 


20 


1 


10 


20 








A 






2 


10 


20 


2 


30 


20 



three among many possible sets, but notice that we would consider the first as 
one In which there was a main effect of A, but no B or AB effects; the second 
would be considered as an example of a main effect for B biit no A or AB effect; 
while the third would be indicative of a situation with interaction and a main 
effect. 

We thus believe that the conclusion one should logically draw from these 
sample values is that there are some effects, but that the data do not per- 
mit a definitive statement as to which. We further believe that a proper 
ANOVA should lead one to draw such conclusions. 
An incorrect "approximate" analysis 

Let us now consider an erroneous analysis which we believe many psycho- 
logists might be inclined to perform. This is an analysis of each factor 
collapsing over the other- Although it does have some intuitive appeal and 
indeed may be useful int.conj unction with other analyses, it will, in general, 
lead Co incorrect conclusions about the population means when^ used alone . 

Suppose collapse the design given the Table 1 over the B classification 
leaving us with two levels (if A with mean values of 10 and 20 as shown iti the 
marginal values of Table 4. A one-way analysis would then lead to the con- 
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elusion that there ^ a significant main effect of A (p-.017>. If we then 
collapse over levels of A we have the levels of B with means of 10.7 and 

Table 4 

Means collapsed over Classification 
B 





1 


2 




1 


10 


10 


10 


2 


. 20 


20 


20 




10,7 


19,3 





19.3 suggesting a B effect (p=.042) as well as an A effect. We would call 
these analyses, respectively "A ignoring B" and "B ignoring A". The use of 
the phrase "A ignoring B" is meant to indicate that in our two-way table we 
"ignore" the B classification and treat the design as if it were only a one- ' 
way classification with levels of A. Observations for a given level of A are 
considered replicates regardless of whether or not they correspond to the same 
level of B (that is, we assume no B or AB effects). When there is no B or AB 
effect, the observations at the several levels of the collapsed factor are, in 
effect', replicates since the variability between levels of B 1« of the same 
order of magnitude as the variability within a level of B. If however, in a 
no n- orthogonal design, there is a B or an AB effect, the estiicated magnitude 
of the A effect (ignoring B) will, in general, be affected by the number of 
observations in the cells and does not represent an unbiased estimate of any 
population value. Only when there are equal numbers of obser^/ations in the 
cells will the estimate of the magnitude, of the A effect be unaffected by the 
number of observations in the presence of a B or AB effect. 

In general, when we are estimating the magnitude of effects, we may safely 
ignore other effects in the design only when they are null or when their esti- 
O mates are independent of the effects in which we are interested. The first 
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condition (that of null effects) depends only upon the state of nature; the 
second (Independence of estimates) depends only upon the actual design of the 
experiment. The conclusions drawn from this ''Ignoring" analysis of Table 1 will 
be incorrect under the (plausible) assumption that there is only one main 
effect in the population responsible for the results • 

For the general non-orthogonal case a different method is necessary in 
order to estimate treatment effects without bias and to provide unbiased tests 
of significance. These are tests of "A eliminating B" and "B eliminating A" 
with corresponding estimates of the effects. In essence these arejtests which 
take into account and eliminate the confounding effects of other factors when 
they are present. Thus a test of "A eliminating B" "removes" any confounding 
effects of factor B, If there is no B effect (i.e., it is null the popu- 
lation) or if the design is orthogonal, there is no confounding duse to B and 
nothing to eLicd.nate; hence the test will be identical to x:w£ of "^'A ignoring 
B", The test oaf "A eliminating B" answiers the question: giissss the possibility 
of a B effect ±s there evidence for an A effect in addition xz: any B effect 
which might be present. On the other hand, the test of "A ignoring B" in general 
answers the question: Is there any evidence for an A effect assuming there is 
no B effect or ignoring it if it is present. The estimate of the A effect 
corresponding to the test of "A eliminating B" is unbiased regardless of the 
existence of any B effect or of orthogonality in the design. It is always 
the "correct" estimate. 
Model Comparisons and Tests of Effects 

The more general "eliminating" method described above Involves fitting a 
model allowing for both A and B effects and then comparing the fit (i.e., the 
quality of the model) to the fit of a model omitting one or more of the effects. 
For example, consider the following models which "predict" the response of a 
subject in the Ij cell of a two factor design 



x« 


Yij-V+ai+Pj+Yij+e 


ZI« 




III. 


^ij ^j"*"*^ 


IV. 




V. 





Model I is the most complete model for a two factor design; it allows 
for an overall level (y), an effect dependent upon the level of factor A (a\^)^ 
an effect dependent upon the level of factor B (Pj)f and an interactive effect 
(Yj^j) dependent upon the joint, non-additive effect of the combination of the 
ith level of factor A with the j th level of factor The ocaer models sxe ob- 
tained from the first by dziopping the interaction rerm aad pcsssibly one asr both 
of dbe mairn effects. Thos^ accustomed to only ortirogonalL ANCJVA will be in- 
cline^i^ to regard model I as :capable of providing the parametric estimatesrneeded 
for tis other models> but this Is not so In general. Each model represents a 
separax^e least squares estimation problem and may provide different estimates 
of the parameters involved. Only in the case of orthogonality will the esti- 
mated parameters for the different models be necessarily the same. Likewise, 
it is only for the orthogonal case that the estimated parameters within a model 
will be statistically independent (unconfounded) of one another. This is the 
real meaning of orthogonality. 

We would begin the analysis of a two-way factorial, either orthogonal or 
non-orthogonal, with the tasst of interaction. Our feelings of parsimony dictate 
a preference for a main effects model if it is consistent with the data and 
so we would wish to compare a model allowing for main effects and interaction 
(Model X) with one only allowing for main effects (Model II)— that is testing 
AB eliminating A and B. In a two factor complete factorial experiment this is the 
usual test of the two-way interaction which is routinely employed. If we are 
able to reject the hypothesis of null interaction effects our usual procedure would 
O >e to stop at this point with an interaction model. If, however, we are unable 
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to rcjjerrr this hypothesis (i*e., conclude that interaction effects are non- 
significant) we would wish to proceed with tests of main effects. 

Vfhen we allow the possibility of both an A and B effect in the population 
we are specifying a series of tests Involving model II. Thus, to test either 

effect we must test ft in that model, implying an alterxiative model in which 
it is absent. To test for an A effect ve compare model II to model III> while 
to test for a B effect we compare model II to model IV. In each of these 
tests w« are allowing for tbe: possible existence of the effect not being 
tested. In testing A we ars asking the question "given the possible existence 
of 3 in our model, do we n^d A?" This is the meaning of the term "A ellmi- 
natlng^B". 

■ OzJT judgment as to whirh model to accept is based upon the relative 
magnitudes of the sum of sgnared errors produced by tka competing models and 
the P t:esz gives a method fsar testing whether the models differ in this res- 
pect. This procedure is always correct, in either the orthogonal or non- 
orthogonal case. In the orthogonal case it will produce results identical 
to those produced by the ordinary computational methods. 

Different tests of A and B effects may be appropriately cbtained by 
beginning with different model assumptions. If we assume that there is no B 
effect, model IV is an appropriate model and we would compare it to model V in 
order to test the existence of aa A effect in model IV (i.e,, without regard 
to the existence of a B effect). This test of "A ignoring B" is not a proper 
test unless model IV is the correct model, i.e., unless there is ti£ B effect. 
Similarly we may test B ignoring A by comparing model III against model V, but 
here the test is proper only if model III is appropriate, i.e., there is no; 
A effect. In the case of an orthogonal design these tests will give us the 
same results as those tests involving model II, but while the results are com- 
putationally the same (due to independence of the estimates of the parameters 
involved) they are not logically the same in terras of comparing the same models. 
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An Example 

Let us now apply this method to the data of Table 1 using the HAINOVA computer 
program (Cramer, 1967)* We may sunaarize all the relevant statistical casts In 

the following ANOVA Table (Table 5): 

Table 5 - 

AKOVA Tables for Complete Analysis of Data in Table 1 



Source 


df 


SS 


MS 


F 


P 


AB 




0.00 


0.00 


0,00 


1.000 


A eliminating B 




370.37 


370.37 


1.646 


.205 


B eliminating A 




0.00 


0.00 


0.00 


1.000 


A ignoring B 




1349.99 


1349.99 


5.999 


.017 


B ignoring A 




979.63 


979.63 


4.353 


.042 


Within Cells 


50 


11250.00 


225,00 







It may be clearly seen that there is no evidence for an interaction; 
however, the small numbers of observations in two of the cells makes the. power 
of this test rather low. Tests of A eliminating B and B eliminating A are 
clearly non-significant, while the tes^ of A ignoring B and B ignoring A, 
given previously, are both significant. All five of these statistical tests must 
be considecsd in order to draw proper conclusions about the population means • 
The tests of A eliminating B and B eliminating A do not provide us with any 
evidence regarding the existence of either A or B effects (although they clearly 
imply that both effects are not necessary jointly) , while the tests of A ig- 
noring B and B ignoring A separately provide us with evidence for either effect, 
depending upon which test we consider. Keeping in mind the models which are 
compared, the "eliminating" tests tell us that we have no evidence for one 
effect in addition to the other . We must conclude then from this statistical 
analysis that there must be some effect, either an A or B effect, but we 
cannot tell which, and there is, clearly, no evidence to suppose that both 
exist. This is in line with the previous conclusion obtained by informal 
arguments earlier. It should be noted that because of the substantially dis- 
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proportionate numfers ohaervacions in cells, the power oS ::he eliminating 

tests is rather low and the effects are highly confounded. Indeed, this ex-* 

ample approaches closely the completely cconfouaded case in rodch all observations 

would be in the A^B^^ ^2^2 ^ completely consaimded case, the 

^one degree of freedom between cells could be attributed to etther an A effect 

or a B effect with no possibility of deciding between them. 

Interpretation of Results 

The patterns of possible results from the analysis of ai two factor 
design with no interaction are given in Table 6, Pattern I -indicates that 

Table 6 

Pattern of Results — Two-way Factorial without Interaction 





Test 


1 


2 


3 


4 


Pattern 
5 


6 


7 


A 


eliminating B 


8 


s 


ns 


ns 


ns 


ns 


ns 


B 


eliminating A 


S 


us 


s 


3SS 


us 


ns 


ns 


A 


ignoring B 


X 


X 


X 


ns 


8 


s 


ns 


B 


ignoring A 


X 


X 


X 




8 


ns 


s 




s«slgnif icant 


ns«=no ns igmifican t 


3£!!5irrelevant 





A and B are both needed in the model since, givsn the presence of one, the 
other is still significant. Patterns 2 and 3 both JJLLustr ate Aliases for which 
a second main effect is not needed given the incirtsxon of the ather, but the 
significant effect must be included (i.e.. From Pattern 2 we would retain the . 
A effect, from Pattern 3, the B effect). Pattern 4 is the case for which no 
main effects are included in the final model. These constitute the standard, 
easy to interp^ret cases and are the only cases which may arise from an ortho^ 
gonal design. The remaining patterns are unique to the non-orthogonal case* 
Pattern 5 is the seriously confounded situation presented earlier in which only 
one effect need be included in the final model, but due to confounding the 
choice of which effect to retain is indeterminant. Patterns 5 and 7 occur 
Q only in situations in which there is very serious confounding in the design. 
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The significant main effect should be included in the final model. In these 
circumstances It is particularly important to ask why such a seriously con- 
founded design was produced and to carefully attend to the implication this 
has to the phenomena being investigated. 
Reconunended_ procedure for a two-way non-orthogonal design 

On the basis of the material developed to this point we suggest the following 
procedure.be employed In the analysis of a non-orthogonal two factor design. It 
should be emphasized that this procedure is for the logical flow of decisions 
and conclusions which are made in such an analysis, but does not dictate the 
actual order In which the computations need be perfowned. Indeed, in most of 
the standard computer programs available for such an analysis (e.g., MANOVA; 
Cramer. 1947) the required tests would be produced in a rather different order. 

However, once the results of all required tests are available, we would suggest 
proceeding as follows: 

A. Begin with the full model including main and interaction effects. 

B. Test for a significant interaction (AB eliminating both A and B), if this 
test is significant no tests of main effects are appropriate; however, one 
might wish to test certain contrasts in the cell means to aid in inter- 
pretation of the results. If the test is non-significant eliminate the 
^ij ^^^^ ^i^o™ model and proceed to step C for tests of main effect. 

C. Test A eliminating ^ and Et eliminating* A 

1. if both tests are significant adopt the model Y »p+a +B +e 

2. if only one of the two tests Is significant adopt the model Y^^«y4a^+e 

(if A eliminating B is the significant one) or Y_,.=ii+B +e (if 

ij j 

B eliminating A is the significant one). 

3. if neither is significant proceed to D. 

D. Test A ignoring; _B and B ignoring A 

• 1. if both are significant retain either or B.» but not both in the final 

^ 3 

model — the choice is indeterminant , In this case additional experimental 
evidence will usually have to be obtained before much could be said 
about the meaning of the experiment. 
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2. if only one of the two tests is significant, the significant effect 
should be retained, but the cautions referred to in the discussions 
of patterns 6 and 7 should be diligently adhered to, 

3» if neither test is significant no main effects should be included in 
Che first mo4el, l#e., adopt the model Y. .••y+c* 

Extension to. Higher Order Designs 
As a non-orthogonal design becomes more complex through the inclusion 
of additional factors the proper analysis becomes far more tedious although 
the basic logical structure remains the s^Tie. In all cases we are attempting 
to find the simplest model which adequately fits the data by comparing com-* 
peting models. As the number of factors increases the total number of 
potential tests (model comparisons) increases very rapidly. For a qrf actor 
design the total number of potential tests is given by 

where ^C^ is the number of combinations of q things taken 1 at a time* In 
most cases, however, not all tests will be performed. 

Because of certain symmetries which exist In the three factor case, the 
extension of the two factor procedure to higher order designs is most easily seen 
through the analysis of the three factor design. In general the process 
begins by determining if the triple order Interaction is necessary. If it 
is* not, one proceeds to determine how many and which second order inter- 
action, if any, are necessary and finally, in the absence of second order 
interaction, how many and which main effects are necessary in the model. 

As a general point it should be noted that when a second order inter- 
action is Included in a model (say the gy term), the main effects Implied 
by that term (in this case 3 and y) will be also included; the other main 
effect- terms (in this case a) may or may not be needed in the model. To 
determine if other main effects should be included requires a separate set 
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Proccdure for three factor non-orthogonal design * 

The process begins by tentatively adopting the full model, Y. -y+a .+3,+y, + 

^^^^ij^^^^^lk^^^^^jk*^^^^^ijk eliminating unnecessary terms. First 

one would test the triple order interaction, ABC, eliminating all second order 
interactions and main effects, i.e., asking the question — given the lower order 
effects do we need the triple order interaction? If the test of the ABC inter- 
action is significant, one would accept the full model and proceed, if desired, 
to test specific contrasts in cell means to aid with interpretation. If on 
the other hand the triple order interaction is non-significant indicating 
that the effect is not required in the model, given the possibility of lower 
order effects, one would drop the ct&y... term from the model and would proceed 

to investigate the second order interaction terms in order to determine how 
many and which terms to include in the model. 

At this point in our discussion, however, we shall consider the pro- 
cedure for main effects rather than second order interactions. We do this 
because some of the concepts carry over directly from the two factor design 
and, given certain symmetries in the three factor design, it is possible to 
then directly apply these concepts to tests of interaction terms. We must 
emphasize that in the actual use of the process, tests of second order inter- 
action would always preceed tests of main effects. 
On notation 

In order to simplify the naming of various tests (model comparisons) in 
the discussion to follow the following notational scheme will be used 

(1) the symbol | will be used to indicate eliminating 

(2) the absence of a term to the right of the | symbol of the same order 
as the term on the left of the | implies that term is ignored 

(3) it is assumed that all lower order terms are eliminated from higher 
order terms. 
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Thus, for a three factor design with factor A, B, C 
AJb,C. implies the test of A eliminating B and C 
A|B implies the test of A eliminating B and ignoring C 
A Implies the test of A ignoring B and C. 
AB|aC,BC implies the test of AB eliminating AC, BC, A, B, and C 
while AB Implies the test of AB ignoring AC and BC but eliminating A, B, and C. 

- Tests of Main Effects 
In testing for main effects we are trying to determine how many effects 
must be included in the model and which ones they are. The only circumstance 
under which it would be necessary to include all of the main effects is when 
each main effect is significant eliminating the other two, i.e., when the tests 
aJb.C; b|a,C; and C|a,B are all significant. If only two of the three tests 
of main effects eliminating both of the others are significant, the two signi- 
ficant effects would be retained while the third would be deleted from the 
model. Thus if all three of the tests or if two of the three tests are signi- 
ficant our conclusions are quite direct— retain the significant effects. 

\4hen, however, only one or none of the three tests is significant the 
situation is somewhat more complex. If only one of the main effect terms - 
eliminating the other two is significant, say A|b,C, the significant term 
shpuld clearly be retained; however, it may be desirable to retain one of 
the other two effects. Since we have already decided to keep the A effect 
we need ask do we need either the B or the C effect given the A effect, i.e., 
to test B.|A and C|A. If neither of these tests is significant then clearly 
neither effect needs be present given the A effect in the model. If one of 
the two is significant, say C |a, that term,.C, should be included in the 
final model along with the A term. Should, however, both be significant, 
we are in an ambiguous situation. Previous tests have indicated that all 
three effects are not needed in the model and that the A effect must be in 
O^. the model, therefore our choice between B and C is completely indeterminant. 
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The potentially most complicated situation obtains when none of the three 
**doubly ellioinating" tests are significant. It is still possible that one 
or two effects should be included in the model. In the two factor 
design, we reasoned that the significance of both A|b and B|A indicated that 
both A and B should be included. In the three factor design there are three 
such pairs of tests Involving A and B, A and C, and B and C (i«e., a{b and 
b|A; a|c and c|a; and fijc and C|B). The joint significance of any one of 
these pairs of tests indicate the need to include the relevant pairs of 
effects, but only two such effects may be included, our previous tests having 

excluded the possibility of all three effects being included in the model. 
If more than one pair of these tests shows significance we a7:e uncertain as to 
which pair of effects to include. This is analogous to the two factor case 
where we were uncertain as to which of the two main effects to include. Should 
no pair of effects be significant we are then left with the possibility 
of including only a single effect in the model. Thus if any one effect were 
to appear significant (e.g., if the tests of a|b or a|c were significant) 
we would include it in the model. Should none of the ^'single eliminating" 
tests be significant we would then examine the "doubly ignoring" tests. A, B, 
and C as these may still indicate the necessity of including a single main 
effect. If none of these tests are significant we would conclude that no main 
effects were necessary and would be left with the model Yiji^~>i'**Ciji^* but 
one of these tests is significant^ that Effect would be included in the final 
model. If two or more of the "doubly ignoring" tests are significant we are 
again in an indeterminant situation and may arbitrarily choose one of the 
significant effects for the final model^ but the choice is completely arbitrary. 

Application to two-way interactions 
The application of the "main effect procedure" to two-way interaction is 
straight-forward if we but note the following symmetry which exists in the three 
factor case. Since there are three two-way interactions and three main effects 
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in a three factor model, the pattern of tests for main effects and for two- 
vay interactions are exactly the same. Corresponding to tests of main effects 
A, B, and C there are tests of interactions AB, AC, and BC, For every main 
effect test, say a|b,C, there is a corresponding test ABIaC,BC. Hence, the 
the above procedure is first applied to the three two-way interactions eliminating 
all main effects and other two-way interactions, i.e., AB|aC,BC, AC|aB, BC, and 
BC|ab,AC, and would then be followed with parallel tests as needed. Should the. 
conclusion be that there are no interactions, the procedure would then be 

applied to the main effects. If there are significant interactions, the factors 
involved should be also included as main effects, as noted earlier. Should 
only one two-way interaction be included, the question of retaining the uninvolved 
r-iin effect should be considered. To do this the test of that effect eliminating 
the other two main effects and the significant interaction should be performed, 
e.g., if it were the BC interaction that were significant one should perform the 
test a|b,Cj;bC in order to determine if the A effect should be included in 
addition to the B, C, and BC effects. 

Some additional comments 

The methods discussed for both the two and three factor cases have 
proceded on the assumption that there is no a priori preference for explaining 
the data in terms of one factor above any others. Such a preference may exist 
in designs such as randomized blocks where we would customarily not even 
consider the test of treatments ignoring blocks; we assume that there are block 
effects and are willing. to consider the presence of treatment effects only 
if the test of treatment eliminating blocks is significant. Similar considerations 
may apply in a wide variety of cases and may simplify the process discussed here. 

Another consideration is the number of tests involved In the complete 
procedure. Some of these tests will be highly correlated and some will be 
independent depending upon the degree and pattern of non-orthogonality. The 
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^extreme is illiistrated by the two factor orthogonal case in which the tests A 
and B are independent while a|b and A are identical. In the case of lack of 
knowledge of likely effects one may perform preliminary combined tests such as 
a test of pooled interaction prior to doing individual tests. This woxild have 
to be moderated, however, by any knowledge which would, a priori, suggest the 
existence of specific effects. 

Overall, Spiegel, and Cohen (1975) have considered some of the problems 
discussed above and have arrived at very different and demonstrably much more 
limited conclusions. Since this is so relevant to balancing, 

we will indicate the serious flaws in oheir "proper" generalization of orthogonal 
ANOVA 

An .Analysis of the Recommendations Presented by Overall Spiegel , and Cohen 

Overall and Spiegel (I969) considered three methods of analysis in nonor- 
thogonal ANOVA without favoring any one as being the appropriate one. Overall, 
Spiegel, and Cohen (1975) then argued that one of the three methods is indeed 
the only proper one to use. In describing how they arrived at this conclusion, 
they note that the strategy that "appeared most often to be recommended in 
applied statistics texts involves basically &. 'main effects' model with teats 
for interaction effects included as a safeguard against departures from additivity 
(Rao, 1965; Snedecor & Cochran, 1967; Winer, 1971). The analysis proposed by 
Appelba^jm and Cramer (197^) follows this logic" (p. iQk) . The argument against 
this approach, as developed by Overall et al. , rests upon a single principle which 
we uelieve to be correct and proper, and a single procedure which is easily 
demons t at ed to be erroneous • 

The principle is "that the method for the analysis of variance of data from 
nonorthogonal designs should estimate the same parameters and test the same hypo- 
theses as can otherwise be estimated and tested in a balanced analysis of variance and 
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experimental design involving the same factors" (p. l8U). This 
is consistent with our views since in our 19TU paper we said, 
"Having decided to employ the method of least squares ... one is 
left only with the selection of possible models and model com- 
parisons. The models selected are logically independent of the 
obser^red numbers of observations per cell" (p. 336). The key 
point which Overall et al. ignore is the choice of model. Given 
a modal, ve would argue that our methods test the same hypotheses 
and estimate the same parameters whether there are equal numbers 
of observations or not. In the absence of a model we believe it 
to be meaningless to talk of estimating parameters, much less 
testing them. 

The procedure Overall et al. propose for verifying that a 
particular method satisfies the above criterion is "to generate 
data for orthogonal and nonorthogonal designs involving exactly 
the same a., By and (aB).j and then to determine which method of 
analysis yields the same parameter estimates in the orthogonal 
and nonorthogonal cases." This procedure is ill-defined since it 
does not state how such data should be generated. If the example 
presented by Overall et al. is meant to make the procedure pre- 
cise, it is clear that their procedure is patently inappropriate. 
Overall et al. present data arranged in a 3x3 ANOVA with thre. 
observations per cell and then duplicate the observations in 
certain cells to make the design nonorthogonal. They state that 
"the reader will appreciate that duplication of certain scores 
does not invalidate the analysis of variance" (p. 181*). Quite 
the contrary, it does invalidate the atialysis of variance since 
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the observations are -learly not independent in the various cells. 
Furthermore if they claim (as they appear to) that the addition 
of observations should' not change the estimates of the parameters, 
it must he the case that the method ignores, in generating esti- 
mates, any information in the additional observations. How can 
a method that ignores such information be a good method? 

We have analyzed the data given by Overall et al. and we 
suggest that even if one ignores the question of independence and 
follows the procedures we have advocated he will not perform any 
tests of "main effects" for the simple reason that there is a 
significant interaction in the data which they present. (A de- 
tailed discussion of the problems involved in testing and estima- 
ting "main effects" in the presence of an interaction follows.) 
Analyzing their data with the additional observations, we obtain " 
an F value of 6.k for the interaction which is significant beyond 
the .001 level. Given this result we would probably wish to look 
at A effects for given levels of B, or B effects for given levels 
of A, or possibly individual interaction contrasts. We doubt 
that ve vrould have any interest whatsoever in any of the para- 
meters that Overall et al- obtain or in any of the main effect 
tests they perform. Indeed, we have made in our earlier 
specific recommendations as follows: 

1. Begin with the full model including main effects and 
interactions effects. 

2. Test for a significant interaction; if this test is 
significant no tests of main effects are appropriate; 
however, .one may wish to test certain contrasts in the 
cell means to aid in interpretation of the results. 
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Procedures for the Case of Significant Interaction 

Since^ in Our previous vortj ve ver e not specific a'b out 
vhat we would do in the case of a significant interaction, it 
may be useful to consider our Interpretation for this example. 
The c ell means and numbers of observe ti ons are as shown In Table 
7. tur standard ASOVA for an Interactive model gives us an esti 
mated standard deviation of 1.93 based on the within cells sum 
of squares. The marginal means shown are the unweighted means 
of the cell means for rows and columns. The significant inter- 
action (F - 6.U, p<.00l) tells us that there are effects of both 
A and B, but that the A effects are different for different 
levels of B Just as the B effects are different for different 
levels of A. It seems clear from examination that the inter- 
action is due primarily to the value 11.7 in cell 13. If we 
delete that cell we can obtain the test of that portion of 

Interaction remaining with three degrees of freedom rather than 
2 

four. On reanalysls, with the 13 cell deleted, we find that 
the interaction is no longer significant (p=.27), strengthening 
the belief that this one cell is responsible for the significant 
interaction. The test of A eliminating B is h?,ghly significant 
(p<.COl) while the test of B eliminating A is marginal (p=.10). 
It appears then that if cell 13 is dropped there is definitive 
evidence only for an A effect. 



c 

Analysis of variance programs such as MANOVA (Cramer, 1967) 
allow ifor the complete deletion of specified cells making such 
an analysis a simple matter. 
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As an alternative or supplementary analysis, we have analyzed 
the simple effects of A for each level of B and the simple effects 
of B for each level of A. These analyses also confirm what in- 
spection of Table 7 suggests; the simple A effects for levels one, 
two, and three of B are highly significant (p=.009, .001, .001). 
The simple B effects for levels one and three of A are significant 
(p=.001, .03); the simple effect of B for level two of A is not 
significant (p=.8l). 

On Main Effec ts, Marginal Effects, and Interaction 

Additional insight into the nature of this problem can be 
gained through a more careful consideration of the problems of 
testing and estimating "main effects" in the presence of interac- 
tion. At this point it is necessary to introduce a basic logical 
distinction bet¥e>en two concepts which have, unfortunately, come 
to be held as virtually synonymous— a main effect and a marginal 
effect. By a main effect we mean the effect of a particular 
experimental treatment or state of nature which is the common 
and consistent effect of that treatment or state of nature irres- 
pective of what other treatments or states of nature it is com- 
bined with. By a marginal effect we mean simply the average 
effect of the experimental treatment (state of nature) averaged, 
in some sense, over all occurrences of that treatment. These 
two concepts are equivalent only in the noninteractive model. 
In the case of a model in which there is an interaction, the two 
concepts are quite distinct; in fact, under the interactive model, 
the concept of a main e:re-t does not apply, for an interaction 
implies that there is no consistent effect of the treatment. 
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Table 7 

Means and Numbers of O'bservatious for Data from 
Overall, Spiegel, and Cohen 
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Table 8 

Illustration of Marginal Means for Interactive Model 
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Table 9 

Illustration of Marginal Means for Non-interactive Model 
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Dut rather that one inust consider a treatment in combination 
with some other treatment(s) in order to assess its effect. 
Vhis distinction can also he seen through the concept of a simple 
row (or column) effect which is commonly defined as the difference 
between a cell mean and its corresponding row (or column) mean. 
If, lor a given factor, the simple effect of the several treat- 
ments srould be identical for all levels of other factors, this 
constant .-iaple effect is the main effect. 

Given then that one is operating with a model which contains 
an interaction term it is, at best, misleading to speak of nain 
effects, for one is considering marginal effects. These marginal 
effects will be averages of cell means across rows or columns of 
the design. There is no particular reason for using a simple 
average rather than a weighted average. If the model is truly 
interactive the weights used will have a substantial effect on 
the marginal effects. Suppose, for example, the cell means are 
as shown in Table 8 for a two by two ANOVA. If we define a 
marginal A mean as 

^i. = ^ ^il - 

we find that the difference in marginal means for A will be 0, 
-10, or 10, depending on whether w is .5, 1, or 0. For the data 
from a noninteract ive model shown in Table 9, we find that the 
difference in marginal means is 20 regardless of what the weights 
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The tests of main effects proposed by Overall et al . in 
Method I are in fact tests of equally weighted marginal means 
for an interactive model. It can also be shown that these test 
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ar^ equivalent to the method of unweighted squares of means pro- 
posed by Yates (1931*) and discussed by Bancroft (1968). These 
fl.x'e tests of the equivalence of row or coluain marginal means 



and 



These particular marginal means are but one of many possible sets 
of marginal means which can be constructed end it is by no means 
clear that this is the most desirable set to test in any parti- 
cular situation (see Appelbaum & Cramer » 1975). 

We believe then that the above analyses reveal essentially 
everything there is in the data. As we have indicated, the tests 
of main effects recommended by Overall et al. are equivalent to 
the tests of equality of marginal means as we have defined them. 
We do not find these tests very interesting since the marginal 
means represent only average effects for rows and columns, while 
the significant interaction tells us that these average effects 
are different from the actual effects for each row and column. 
The marginal A effect is significant (p=.001); the marginal B 
effect is not (p=.23). 

We would argue then that the example presented by Overall 
et al> does not bear on the validity of the methods we have advo- 
catedj for the simple reason that there is an interaction present 
furthermore it seems to us that their analysis of ''main effects'* 
is not directed to the questions that psychologists will typi- 
cally vish to address. We could of course modify their example 
so that the Interaction is nonsignificant. Then, as we have 
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noted, it vould violate the assumptions of independence- Ihoir 
procedure is. siaply not valid in principle. 

Estimation a nd the Overall et al . Criterion 

Overall et al . have erred in assuming that if tvo methods 
estimate the same parameters, they nust yic^ld the same estimates. 
This is ohviously false. To estimate a population mean ve could 
use a sample mean or simply use the first observation, discarding 
the others. Both the sample mean and the first observation are 
unbiased estimators of the population mean, but they will, in 
general, yield different estimates. The sample mean is better 
since it will be closer to the population mean on the average. 
This precision of estimates is the crucial distinction betveen the 
- methods Overall et al. advocate and the methods we advocate. 

In our 1971* paper the topic of estimation in the nonorthogo- 
nal ANOVA did not appear since we did not believe that there was 
any disagreement as to what was appropriate. We now feel that 
this topic does require some attention. 

The estimation problem is easily and completely solved once 
one decides upon the model which one believes applies to the real 
vorld. The usual role of significance testing is to determine, 
based upon the data of the real world, which model is the most 
reasonable one from among a set of competing models. Having 
made a decision as to which model obtains, one may then proceed 
to estimate the parameters of the model-~but estimation may occur 
only in the context of a particular model. 

Let us now consider one possible model— the two factor 
interactive model 
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It is a trivial matter to ol)tain a set of least squares estimate 
of the parameters of this model. We say "a set" because there 
are infinitely many sets which are equivalent in the aeuse that 
they will yield identical predicted values Y . It is , however, 
a standard practice to impose additional constraints upon the 
model in order to obtain a unique set of estimates. The purpose 
of the constraining system, however, is solely computational con- 
venience. It is obvious that the very best we can do in this 
model is to predict the cell means exactly, since there are no 
parameters which are unique to any single observation. Any two 
models which predict the cells means exactly must be equivalent. 
It is also a consequence of the mathematics of the system that 
any model which has as many independent parameters as cells must 
predict the cell mean exactly. 

There exist infinitely many constraining systems which may 
be applied to the full interactive model in order to produce the 
computational determinacy desired. The simplest of these is 

y = a. = $j = 0 (for all i and j) 

leaving the model 

^ij = + e (2) 

In this case the Least Squares estimates of the y are simply 

the observed cell means, ?. , 

1 J 



iS' 



The more usual (conventional) constraining system, however, 
Ecx, = JBj = Zy^j =: ZY^j .0 (3) . 
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If the design has a levels of factor A and b levels' of factor 
B, there are then (after the imposition of the constraints) 
1 + (a-1) + (h-1) + (a-l)(b-l) a ab independent parameters vhich 
is also the nxxmber of independent parameters in (2). This equals 
the number of cells in the design, and it then follovs that the 
model constrained in this way must be equivalent to that in (2). 
For those familiar with the matrix approach to the analysis of 
variance this result is easily seen from the fact that the model 
matrix for this constrained design must have ab columns. 

It is also a rather trivial matter to directly write the 
least squares estimates of the parameters of the interactive 
model Constrained by (3). They are 

y = y 



a. = Y. - Y 
1 1 . 



^3 ' '.J - 



Where Y^^ is the unweighted average of the cell means while Y 
and Y^j are the unweighted averages of the cell means for row i 
and column J, respectively. Substituting these estimates into 
(l) gi\es 

again showing the equivalence of (l) and (2). 

We thus see that estimation in the interactive model is 
rather trivial, with the estimates of the parameters being simple 
linear functions of the observed cell means and free of the n 
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In point of fact there is really no gain in talking of 
estimating parameters in (l) since it is equivalent to (2) which 
is a cell means model requiring no constraints. We have a x h 
populations (one per cell) and the only par ameters of interest 
are their means and their common variance. 

The situation, in general, is not nearly so simple when 
there is no interaction^ that is, when estimation proceeds within 
the model 

^ij = ^ + «i + By + e (5) 

In general we would have to solve a set of simultaneous least 
squares equations in order to obtain estimates of parameters in 
(5). An exception occurs, however, in the orthogonal case in 
which the estimates of y, a^^ , and have the form as in (U). 
In the more general nonorthogonal case, there will again be 
infinitely many solutions to the unconstrained least squares 
equations although estimates of y + a and y + will be 
unique • 

An interesting result of least square estimation in (5) is 
that the estimates obtained for y, a, and B from model (l) yield 
unbiased estimates of Y^^^ in (5), but the estimates are less 
precise, that is, they have larger variances than the estimates 
obtained from (5). For this reason it will be desirable to 
estimate the parameters of (5) when we have accepted (5) as the 
true model rather than use the estimates from model (l). 

We would agree that the procedures advocated by Overall 
et al. test the same hypotheses in both the orthogonal and 
nonorthogonal case; further, w :^ agree that they are valid tests 
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of certain hypotheses, hut we douht that they are hypotheses of 
particular interest in either the orthogonal or nonorthogonal case. 
We believe that an informed statistical analyst vould not perform 
a test of main effects in the presence of a significant inter- 
action in the orthogonal case; why then in the nonorthogonal case? 

The Case of Nonsignificant Interaction 

Of course, we can only know in a probabilistic sense if 
there is truly an interaction present in nature. We must in the 
final analysis rely on the results of statistical tests to direct, 
us to reasonable models upon which to base our estimation proce- 
dure. This then leads us to ask what behavior is appropriate 
when the data dictate an interaction free model and to- consider 
the consequences of such behavior. There are, in this respect, 
only three cases vhich need concern us; in all three we vill 
assume that the statistical test of interaction is nonsignificant. 
Case 1 ; No interaction in the population 

The first case ve shall consider is the case when indeed 
there is, in nature, no interaction present. No empirical demon- 
stration is needed to verify that if one has the form of the true 
linear model, the least squares estimates of the parameters in 
that model will be the best unbiased linear estimates. Further- 
more, it is completely obvious that if one fits an interactive 
model when there is in fact no interaction, one will obtain 
unbiased estimates which will not be minimum variance. For this 
leason it is a mistake to include worthless effects in an ANOVA 
model. Just as it would be to include worthless variables in a 
regression problem. The additional sampling error causes the 
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main effect parameters and the estimated cell means. to have larger 
standard errors than would the estimates from a main effects model. 
This point has been noted in a regression context "by Walls and 
Weeks (19^9) and is exactly what would occur if Overall and 
Spiegel's Method I were applied in this case. The increase in 
sampling error may he quite substantial and will result in less 
powerful tests of main effects. 

Case 2 ; Small hut nonsignificant interaction effects 

The second case is the situation in which there is a true 
interaction hut its magnitude is too small to be detected by a 
conventional test of interaction. We have previously argued that 
the main effect parameters are not meaningful for the interactive 
model, but that the predicted cell means are. The predicted cell 
means will have a smaller variance when estimated in the main 
effects model than when estimated in the interactive model since 
the variance depends only upon the design matrix (X i„ the usual 
matrix approach to ANOVA ) and the variance of ths dependent 
variable. The predicted cell means will, however, be biased in 
this case. Since we can no longer speak of minimum variance 
unbiased estimators, it then becomes the mean square error which 
is relevant for comparison. We must add the mean squared bias 
(which will be a function of the magnitude of the small but nonzero 
interaction terms) to the variance to obtain the" mean square error. 
This term will be snail if the interactive effects are small as 
would be the situation under Case 2. Operating under Case 2, we 
will still be estimating the same parameters and testing the same 
effects in both the orthogonal and nonorthogonal cases, but we 
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vill Simply be estimatirg and testing with a aaall* amount of bias. 
We will gain substantially in that the estimates will be more 
precise and the tests vili be more powerful than if ve followed 
Overall and Spiegel's Method I which is based upon the interactive 



model 



To see the difference, let us compare the variances of the 
estimated parameters and estimated cell means f.. the data used 
by Overall et al. m Table 10 ve have computed the variances of 
the estimated main effect parameters and the predicted cell means 
for both ^he main effects model (our procedure) and the inter- 
active .odel (overall^ and Spiegel's Method I). if x is the matrix 
Of independent variables, the variance-covarianoe matrix of the 
estiH^ated parameters is (X'X)-^ while the variance-covariance 
^ matrix of the predicted cell .eans is X iX^xrh^o^, The variances 
are on the diagonals of these matrices and do not depend upon ' 
Which model is correct in nature. Since serves only as a scale 
factor, ve have assumed in Table 10 that it is equal to one. We 
see that the estimated parameters of the main effects model have 
Slightly smaller variances than those of the interactive model. 
While the corresponding predicted cell means have substantially 
smaller variances when esti„,ated from the main effects model. 
The effect on the predicted cell .eans is particularly marked 
for the cells vith a small number of observations, since the 
variance of a sample mean (the predicted value for an interactive 
model) is simply a^/n. 

Ciise_3: A large interaction which is not detected 

The third case covers the situation where a large interaction 
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Table 10 



Variances of Parameter Estimates and Predicted Cell 

Means for 3x5 Factorial Design with Unequal Numbers 

2 

of Observations Assuming iy ^ 1. 



Main Effects Model. Interactive Model 

Parameters 
Estimated 

P .161 .169 

.2kl .2kk 

^2 .235 .2hk 

Bo "22li .231 



Estimated 
Cell Means 

Y^-L .116 . .167 

^2.2 -151 .333 

^13 -158 .J33 

^21 -158 ■ .333 

.112 .167 



'13h .333 

.151 .333 

Y32 .109 .167 

^23 .112 .167 
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is somehow not detected by the interaction test. in this situa- 
tion the reverse of Case II will occur and the mean square errors 
of the estimated parameters and cell means will be small for the 
interactive model. The probability of this third case occurring 
is, however, remote, for if the magnitude of the interaction 
effects is large and if the sample size is reasonable the power 
of the interaction test is qui t enlarge . 

We have shown above that the significance testing 
procedures which ve have previously recommended for the nonortho- 
gonal ANOVA are consistent with the basic principle advocated by 
Overall et al.; namely, that in the nonorthogonal case one should 
estimate the same parameters and test the same hypotheses that 
one would estimate and test if there were equal numbers of obser- 
vations in the cells. Indeed, that principle .is implicit in our 
original paper. We have pointed out that our method of fitting 
a series of me,in effect models (in the absence of interaction) 
is not the same as their Method II. We have further shown that 
their method for achieving the stated goal is incorrect and, if 
routinely applied, will not lead to optimal tests or estimates. 
In discussing the relationship between estimates and hypothesis 
testing we believe that we have made clear the reasons for pre- 
ferring our procedure since it leads to more powerful tests and 
more precise estimates. 

It must be recalled that the issue of how to estimate effects 
arid how to test hypotheses are rather distinct. The methods 
discussed by Overall et al. and by us are methods for testing 
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hypotheses and not for th^ estimation of effects. This distinc- 
tion is not one which ve uniquely make. Bock (l975). for instance, 
regards these as distinct processes. He hegins with some initial 
model, performs tests of significance to determine if a simpler 
model is appropriate, and then estimates the parameters in the 
simplest reasonable model. We have seen no evidence which sug- 
gests that the methods advocated hy Overall et al. are preferable. 
We continue to maintain, along with Rao and others, that one 
should test main effects, assuming no interaction to he present, 
when this is what is suggested hy the data at hand. 
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Chapter IV: A Comparison of Balancing and Other Methods of Adjustment 

Several alternative methods are available for adjusting for group 
differences in a dependent variable when the groups are not randomly 
constituted and thus may exhibit systematic differences on interfering 
variables that are related to the dependent variable. The best known of 
these methods is analysis of covariance. Other methods , based upon 
somewhat different assumptions, include direct and indirect standardize^tion 
and balancing. 
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Analysis of Covariance 

Analysis of covariance (see, e.g., Elashoff, 1969; Tatsu- 
oka, 1971) assumes that, in the population of interest, the 
i th person in the J 'th group has a score Y., on the dependent 
variable that can he expressed as 

(where = y + a^). In this notation is the adjusted pop- 
ulation mean on the dependent v^iriable for the J'th group- 8 
is the within-group regression coefficient; X. is the score 

on the interfering variable (the variable for which adjustment 
is made) for the i'th person in the J'th group; X is the mean 
score of the observations over all groups on the interfering 
variable; and, e^^ is an error term for the i'th person in the 

J'th group. The mean score on the dependent variable for the 
J*th group can be expressed as 

yj=yj*0(Xj.x)*ij , 

^ere is tlie mean observed scor^ on the interfering variable 
for group J and ij. Is the mean error term for group J. 

For more tha^i one interfering variable, a model of the 
following form is used: 

L « y. E B^"'^ + 5, 

m J J 

. o'-m) 

where B^ is the within-group multiple regression coefficient 
for the m'th interfering variable; xj""^ is the mean score for 
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the J'th group on the m'th interfering variable; and, X'"^^ is 
the mean score over all groups on the m'th interfering variable. 

Least squares estimates for the parameters of the model 
are obtainable and parameter values can be replaced by these 
estimates to obtain an estimate of the adjusted mean score for 
the J ' th group , 

Balancing, Direct St andardization, and Indirect Standardization 

With one interfering variable, the model used in balancing 
direct standardization, and indirect standardization is the 
additive analysis of variance model. This model assumes that 
m the population the i'th person in the j'th group with a 
■ score at the k'th level of an interfering variable has a score 
ijk ^^"^ expressed as 

V = ^""j*\*^Jk ' 

or 

■^ijk = ^j * \ * -ijk • 

The mean score for persons in the J'th group and the k'th level 
of the interfering variable then can be expressed as 

^Jk = * \ - ^jk • 

In this notation is the adjusted mean (in the population) on 

the dependent variable for the J 'th group and y. is an effect 

associated with the k'th level of the interfering variab.le. 

With mere than one interfering variable, balancing still 
Sia^ b.o employed, based upon the additive analysis of variance 
model. Direct standardization and indirect standardization 
usually are defined only for one interfering variable. However 
each can be generalized to accomodate more than one interfering' 
variable. The generalized model for either direct or indirect 
standardization also allows for all possible interactions among 
the interfering variables. For example, for three interfering 
variables , balancing employs a model of the form 
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generalized direct and indirect standardization are not re— 
.stricted to an additive model and may use a model of the form 

More will be said about generalized direct and indirect stand- 
ardization in a later section. 

While the analysis of covarianc^ model (of the previous 
section) can treat interfering variables either as continuous 
or discrete, these analysis of variance models must treat all 
interfering variables as discrete, since each is to be cate- 
gorized by level. However, measurement on a continuous vari- 
able must always result in observed values on a discrete vari- 
able, since th^. measurement process yields a finite set of 
possible values while the number of possible values of a con- 
tinuous variable is indefinitely large (see Jones, 1971). 
Thus the operative distinction is that, with AflCOVA, an inter- 
fering variable may be zueasuired with an indefinitely large 
number of score categories, while with balancing and standard- 
ization (direct and indirect), it is desirable that the number 
of score categories be limited. Resxilts given by Cochran 
(1968a) suggest that only a slight loss of precision is as so- • 
ciated with categorizing a continuous variable and then uring 
standardization instead of using analysis of covariance with 
the original continuous variable as the covariate. Since we 
Vish to compare ANCOVA, standardization (direct and indirect), 
and balancing, the remainder of this discussion assumes that 
variables are discrete ^ .^iither because this was their original 
form or because they have been categorized. 

Balancin^T — The technique of balancing was developed for 
the National Assessment of Educational Progress in an attempt 
to present estimates of educational achievement that are rela- 
tively uncontaminated by interfering variables (see National 
Assessment of Educational Progress, 1973)- Appelbaum and 
Cramer (l9T5) have shown that the estimates of parameters from 
balancing are the least squares estimates from an additive ana- 
lysis of variance model, obtained by solving the normal equa- 
tions. The primary estimates of interest are the estimates of 
the adjusted mean scores, the p^. 

There is a systematic relationship between estimates ob- 
tained by ANCOVA and balancing when interfering variables are 
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discrete. To understand this relationship, it must bo reali-ei 
that nonorthogonal MOVA estimates can be obtained by usinr 
orthogonal polynomial contrasts for each interfering:" variable 
as covariates in the ANCOVA model, since both Ai:0VA and AKC07A 
are part of the general linear model (e.g., Bock, 1975, chap.' 
5; Cohen, 1968). Analysis of covariance, however, uses only 
the linear trends of each factor as covariates, vhile bal- 
ancing uses all trends of each factor as adjustment variates. 
Thus, estimates from the ANCOVA model are equivalent to those 
.from a balancing model that assumes all trends other than the 
linear to be equal to zero. 

The choice between using balancing or ANCOVA involves a 
trade between bias and variability of the estimates. Parameter 
estimates obtained from a model are unbiased only if that 
model is valid in the population (Draper and Smith, I966). 
Analysis of covariance, uiilike balancing, requires that all 
nonlinear trends be zero in the population if its estimates 
are to be unbiased. Thus, estimates from ANCOVA are at least 
as biased as estimates from balancing. On the other hand, 
since the parameters of the ANCOVA model are a subset of those 
cf the balancing model, and since both models may be conceptu- 
alized as regression models, the variance of balanced estimates 
must be at least as large as the variance of estimates from 
ANCOVA (see Walls and Weeks, 1969, for the general regression 
case ) . • 

Vnien it is certain that the relations between the inter- 
fering variables and the dependent variable are essentially 
linear, analysis of covariance is to be preferred to balancing, 
since estimates from both models are unbiased but those from 
the ANCOVA model are less variable. Wh.en relations are mate- 
rially nonlinear, balancing is generally to be preferred to 
ANCOVA, but the magnitude of nonlinear trends and the sample 
Bize both: r:hould be considered. Estimates I'rora balancing are 
less biased than those from analysis of covariance; the dif- 
ference between the squared biases of the estimates from bal- 
ancing and from ANCOVA depends upon the magnitude of nonlinear 
trends and is independer:t of sample size. However, the dif- 
ference between the variances of the estimates from balancing 
and from ANCOVA is inversely proportional to sample size. 
With few observations, the difference between the variances 
more likely to exceed the differences betveen the squared 
biases, in which case ANCOVA has the smaller mean-square error 
and on that basis ANCOVA is preferred to balancing.' With a 
sufficient nxmber of observations, however, the difference be- 
tween the variances is unlikely to exceed the difference be- 
tween the squared biases, in which case balancing, with a 
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smaller mean-square error, is to be preferred, (An alternative 
iinder these conditions, not considered here, is a generalised 
ANCOVA, which adjusts for socie subset of the npnlinear trends,) 

Direct standardization Direct standardization has been 
used extensively by demographers, biostatisticians , and health 

researchers desiring to adjust for interfering variables in 

the comparison of group effects. Basic references for direct 
standardization are Fleiss (1973) and Kalton (1968). An exam^ 
pie of the use of direct standardization is presented by Moses 
(1969) in connection with the National Halothane Study, where 
the desire was to assess the effects of halothane and other 
anesthetics on death rates , taking account of the differential 
patient characteristics associated with the use of various 
anesthetics , 

With one interfe: ng variable, direct standardization is 
based cn the same mode:, as balancing, but involves a different 
procedure for estimating parameters. The first step in this 
procedui-e is to estimate parameter differences of the form 
" where j and j' represent distinct groups. Direct 

standardization est im' ^vs this difference to be 

where the represent weights chosen by the experimenter. 

Kalton (1968) has shown that, when adjusting for one inter- 
fering variable and comparing the means of two groups, a mini- 
mum variance estimator of this difference (assumed to be con- 
stant for all k) is obtained by choosing weii?hts such that 




" -Ik * "2k 



This derivation assiimes equal variance within each cell of the 
design. While the assumption is unlikely to be valid for pro- 
portions, Kalton (1968, pp.. 127-12 shows that for proportions 
this choice of weights usually is adequate and sometimes is 
preferable to the use of weights obtained assuming unequal cell 
variances. Thus, the difference between the means of the two 
groups at a particular level of the interfering variable is 
weighted inversely proportional to the variance of the differ^ 
ence between the meanj. Intuitively, when both groups are veil 
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represented at a particular level of the interfering variable, 
the variance of the difference between the group means will be 
relatively small, and that level of the interfering variable 
will receive a relatively large weight in the estimate of the 
adjusted difference between the group means, Kalton states, 
"If the Y^jj^ are normally distributed within subgroups, this 

model is the usual fixed effects analysis of variance model, 
with 



k 




k ^ 

estimating a main effect" (Kalton, I968, p, 123; the notation 
is changed to correspond with that used here). Direct stand- 
ardization with these rainimvim-variance-producing weights will 
yield the same estimates as balancing, i.e., the estimates 
from an addilive analysis of variance model, with or without 
a normal error distribution. 

Snedecor and Cochran (1967) show -^hat, in an additive 
two-factor analysis of variance model where one factor has twc 
levels, the estimated differential effect for that factor is 
'given by 

r ^lk"2k - . 

k ^k " ^2k 
. ^lk^2k 



k ''ik * ^2k 



confirming that direct standardization with this choice of 
weights does produce the same estimates as the least-squares 
estimation procedure of an additive analysis of variance model. 
When both factors have more than two levels, however, this 
choice of weights does not in general yield the same estimates 
as the least-squares estimation procedure used with the addi- 
tive analysis of variance model. Thus, when more than two 
groups are to be compared, the weights presented by Kalton 
(1968) will not produce the same estimates as an additive ana- 
lysis of variance procedure. Intuitively, this can be under- 
stood by noting that a comparison of two of the groups by 
direct standardization completely ignores all other groups when 
adjusting for the effect of the interfering variable. In con- 
trast, the standard estimation method under the analysis of 
variance model uses all groups to estimate the effect of the 
interfering variable. 
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The remainder of the discussion of this weighting proce- 
dure assumes that only two groups are to be compared. 

While direct standardization usually has been used only 
vhen one interfering variable is to be adjusted for, the proce- 
dure can be generalized to more than one interfering vaiiable. 
For example, consider three interfering variables with K, L, 
and M levels. This design can be re-expressed; as a design with 
one interfering variable with KxLxM levels. Direct standardi- 
zation with miniravm-variance-producing weights for this design 
vill then yield the same estimates fcr adjusted group means as 
would the standard estimation methods for an analysis of vari- 
ance model of the form 

With V -e interfering variable (and only two groups) bal- 
ancing and direct standardization give the same estimates. 
With more than one interfering variable, the estimates from 
balancing and direct standardisation generally will differ. 
Direct standardization require? that there be at least one ob- 
servation for every combination of the interfering variables, 
so that 

^Ik^ ^ ^2k^ ^ ^ 

for ail k, and m; othei-wise, estimates cannot be obtained 
in this model, because division by zero wo\ild bo^^reauired. 
In this study, involving five interfering variables,' direct 
standardization is not employed because this requirement fails 
to be satisfied. 

Another weighting procedure for direct standardization 
^eems to be more widely used than that described by Kalton 
P968). This procedure, described by F.leiss (1973), Cochran 
{1968a), and Moses (I969), uses weights of the form 
""k ' ^^jk " \* so that 
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Both this weighting procediire and Kalton's (1968) yield luibi- 
ased estimates of the group effect in the analysis of variance 
model, but in general the variance of the estimate based upon 
■ this procedure is at least as large as the variance based upon 
Kalton s procedure. Intuitively, this procedure takes into 
account only the total number of observations for a level of 
the interfering variable; iy neglecting hov this total is dis- 
tributed for the different groups, an unstable estimate of the 
mean for a particular group at some level may receive a large 
veight with this procedure. This weighting procedure requires 
that no cell in the analysis of variance design be empty. For 
example, given three interfering variables with K, L, and M 
levels, this procedure requires that 

°Jk^m ^ ° 

for all J, k, Z, and m, a u-^re stringent requirement than for 
Kalton's procedure. Thus, this weighting procedure, although 
the one usually used, has no advantages over the weighting 
procedure presented by Kalton (1968;; Kalton's procedure does 
possess several advantages over thi:, alternative procedure. 

Indirect standardization ~ Indirect standardization has 
been used even more extensively than direct standardization by 
demographers, biostatisticians , and other medical researchers 
according to Fleiss (l973). The probable reason for the 
greater usage of indirect standardization is that, unlike the 
usual form of direct standardization (as presented by Fleiss, 
1973), indirect standardization does not suffer from the prob- 
lem of assigning large weights to unstable cell means anE it 
may be used even if a cell in the design is empty. An example 
of the use of indirect standardization is again the National 
Halothane Study, discussed by Moses (1969). 

With one interfering variable, indirect standardisation 
is based on the same model as balancing and direct standardi- 
zation, but employs a different method for estimating para- 
meters. There are two approaches to indirect standardization 
both of which have been developed only for one interfering 
variable. However, for more than one interfering variable, a 
generalxzed indirect standardization procedure can be defined 
m a manner analogous to that for direct standardization, 
where the design is re-expressed as a design with one inter- 
fering variable. The following discussion assumes there to be 
only one interfering variable, either because this is the 
original design or because the original multivariable design 
has been re-expressed. 
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One approach is that given by V/iley (l9T3). Whereas balan- 
cing obtains estiirM?,tes by solving the normal equations to arrive 
at a simultaneous fit to the data, indirect standardization as 
defined by Wiley obtains estimates in a two-step process. First, 
estimates of the effect associated with each level of the inter- 
fering variable are obtained in a model that assumes no main ef- 
fect to be associated with group (i.e., the model assumes cx^=0, 
for all J ) ; the model thus is of the form 

Least squares estimates are obtained for P and Yj^. These esti^ 
mates are given by 

/\ 

y = G (the weighted grand mean) , 
and ^ • 

Second, these estimates are used to obtain the estimated adjusted 
score for the J'th group in the model with two main effects. 

It now will be shown that when V, and P are constrained to Y, and 

k k 

G respectively, the least squares estimate for P. is given by 

J 

y = G + ~ — 



"J 

the estimated adjusted score of Wiley's approach. The squared 
error for the J'th group is given by; 

which equals _ ^ ^ ^ ^ 

Substituting = Y^^ and y = G and setting the derivative with 
/\ 

respect to y equal to zero, 
J 
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G + ~- ; = u . 

To co2i5)are Wiley's formula with those for balancing anc 
direct standardization, we can compute 
^ ^ ^ ^ 

When comparing two groups (the case for which adjustment 
techniques are most often used), indirect standardization as 
defined by Wiley provides the following estimate of Q - d ■ 

1 ^2' 



Since 



V2 



; _ . ^"l ^ "2)"lk"2k(^lk - ^2k) 

1 2 ^ n^n^Cn^ + n^ 

This can be rewritten as 



n + ng ^ik^pk 
^ 2 . V2 k '^Ik ^ '^2k 1^ 2k ^ 



- (^] (r "lk"2k \ 



^j^lkllgl^ - , 
k "lk-^"2k 2k ^ 

J. "lk"2k. 
k '^lk*'^2k 
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Bat the third expression in parentheses is the estimate for 
balancing. Thus, 

Wiley's estimate of the group effect equals the balancing esti- 
aiaxe (and thus also equals the estimate from direct standardi- 
zation) if and only if 

"1^2 J. ^lk^2k 
^1 * °2 " k °lk °2k 



9 



that is, if and only if 

k/^k^^ "lk"2k 
I ^^k ^ -2k) ~ k ^Ik ^ -2k * 

Another approach to indirect standardization, discussed by 
Fleiss (1973), involves multiplication of the grand mean esti- 
mate by the estimated mean unadJUEied scor" of the j 'th group, 
and division of this product by ^he mean score that the j'th 
group would have received if its mean soore on the dependent 
variable within each level of the interfering variable had been 
the same as that of the population. This approach uses the 
same estimates as Wiley's approach but combines them in a mul- 
tiplicative rather than an additive fashion. It yields an esti- 
mated adjusted score for the j'th group that can be written as 



111. 



n Y 
1 n. 



k 



""jk^k 

n. 
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An evaluation of indi rect standard! ::at ion — Analysis of 
covariance, balancing, and direct standardization have been 
compared and contrasted, with a discussion of the relative 
merits of each. Still to be discussed is the relative valu- 
of each of those approaches compared with indirect standardi- 
zation when adjusting for only one interfering variable 



Number of Subjects and ...... u Score by Group 

Within Each Level of the Interfering /ariable 



GrouTD 1 



Level of 

Interfering 

Variable 



Marginal 



I — ■ ■ 

N=l 


N=l 


Y=20 


Y=10 


K=l 


N=2 


Y=U0 


Y=30 


H=2 


N=3 


Y=60 


Y=50 


N=9 




Y=80 


Y=70 


N=2 


N=5 


Y=iOO 


Y=90 


N=15 


N=15 


Y=73.3 


Y=63.3 



Marginal 
Y=15 

Y=33.3 

N=13- 
Y=7T 

Y=92.9 

N=30 
Y=68.3 



Two examples are cited tc illustrate certain features of 
tne two alternatire estimation procedures for indirect stand- 
ardization.. ..First, consider the fictitious data given in 
iab.^el . witiijn each group, there is a perfect linear rela- 
tionship between the interfering variable and the dependent 
variable. Also, at each level of the interfering variable, 
Oroup 1 has a mean -core 10 points higher on the dependent 
variable than does Group 2. Note also that the freauency 
distribution of scores on the interfering variable is quite 
different for the two groups. 
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TABLE II 

Adjusted Mean Scores from Table Y 



Group 1 Group 2 Difference 

Wiley 72.71 63.96 8.75 

Fleiss 72.67 63.92 8.75 

ANCOVA 73.33 63.33 10 

Balancing 73 .33 63 . 33 10 



Adjusted meatn scores for the tvo groups as derived from 
each adjustmerit technique are given in Tahle II. Analysis of 
covariance and "balancing both estimate the difference between 
adjusted scores to be 10, while either method of indirect 
standardizat-^.on estimates the difference to be slightly less 
than 10. Buo the evidence is that, within any level of the 
interfering variable, the difference is in fact 10, so this is 
the desired difference between the adjusted scores. Analysis 
of covariance and balancing both recover this difference j in 
contrast to both forms of indirect standardization. 

As a general rule, analysis of covariance will recover the 
desired dif fererice whenever there is no interaction between the 
interfering variable and the grouping and all higher-order 
trends are zero in the data. Balancing vill recover this de- 
sired difference whenever there is no interaction between the 
interfering variable and the grouping. When there is such an 
interaction, the meun score difference between groups varies 
depending on the particular lv?vel of the interfering variable, 
so thers is varying evidence on the- difference between the 
groups. In general, indirect standardization will not recover 
this desired difference. 



A second examp7.v -nows another way in which either ap- 
proach to 'indirect s v^v^rdardization may yield misleading results. 
Consider the data in Table III. For these data. Group 1 has a 
mean score of 60 at every level of the interfering variable, 
while Group 2 has a mean score of hO at every level. In addi- 
tion, there is little overlap between the two groups on the 
interfering variable; only for level 3 are there observations 
for both groups, and here, as elsewher>--.;, Group 1 has an average 
score of 60 while Group 2 has a mean score of Uo. It seems 
reasonable to conclude that within groups? the interfering vari- 
able and the dependent variable are unrel-.; ced; instead, group 1 
members tend to score higher than Group 2 icembers on both the 
interfering variable and the dependent variable. 
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TABLE III 



' Number of Subjects and Mean Score by Group 
Within Each Level of the Interfering Variable 



Level of 

Interfering 

Variable 



Marginal 



Grout) 1 


Group 2 


Marginal 


N=0 


K=50 








Y-kO 


N=0 


N=25 


N=25 




Y=UO 


Y=UO 


N=25 


N=25 


N=50 


Y=60 


Y=UO 


Y=50 


N=25 


N=0 


N=25 


Y=60 




Y=60 


N=50 


N=0 


K=50 


Y=60 




Y=60 


N=100 


N«100 


N-200 


Y=60 


Y=U0 


Y=50 



Both analysis of coveiriance and i.^alancing support ':Ms 
conclusion, as seen in Table IV. The adjusted score Tor each 
group equals the Tinadjusted score for the group. Indir>3ct 
standardization, hovever, gives the impression that the differ- 
ence between the mean scores of the groups can be explained by 
their being different on the interfering variable. The adjusted 
rates obtained from indirect standardization are very nearly 
equal for the two groups. 

TABLE IV 
Adjusted Mean Scores from Table III 



Group 1 Group 2 Differen ce 

Wiley 52.50 U7.50 5. 00 

Pleiss 52.17. 1+7.06 5.11 

AKCOVA 60 kO 20 

Balancing 60 kO 20 
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Adjusted mean scores obtained from balancing vill differ 
fi-om unadjusted mean scores if and only if the interfering 
variable is related to the dependent variable within each group 
(homogeneity of the relation is assumed) and the groups have 
different means on the interfering variable. For analysis of 
covariance, there is the additional restriction that adjusted 
scores vill differ from vmadjusted scores only to the extent 
of linear relationship. Indirect standardization, on :he other 
hand, may give adjusted scores that are different from the 
unadjusted scores despite groups having the same mean on the 
interfering variable (see Tables land II), or despite there 
being no (within group) relation betveen the interfering vari-- 
able and the dependent variable (see Tables III and IV). 

With reference to admission uata, let us consider an exam- 
ple of hov either approach to indirect standardization may be 
misleading. Suppose that in Table III, Group 1 represents male 
applicants. Group 2 represents female applicants, and the inter- 
fering variable is height. In these hypothetical data, male 
applicants all have heights which place them in level 3 or 1^ 
or 5, while female applicants all are placed in level 1 or 2 
or 3. Male applicants have an average admission rate of 60 
percent regardless of their height, while female applicants 
have an average of hO percent regardless of their height. 
There is no relation betveen height and admission within sex, 
i.e., an admission committee does not act on the basis of an 
applicant's height. Thus, in all probability, if the average 
height of women were to increase, their admission rate would 
stay the same. But indirect standardization leads us to be- 
lieve that if women were only taller, they would be accepted 
at almost the same rate as men. From the results of indirect 
standardization, it would seem that males and females are being 
accepted at nearly the same rate, once we take into account the 
difference in average height. But results obtained from bal^ 
ancing and analysis of covariance will yield an adjusted admis- 
sion rate for male applicants of 60 percent and a rate for fe- 
males of ko percent, implying that the admission rate for males 
would remain substantially higher than the rate for females 
even if the average height, for female applicants were to in- 
crease. 

The example given in Table III is an extreme case illus- 
ti-ating a possible difference betveen results obtained by indi-- 
rect standardization and resuJts from balancing and' analysis 
of covariance* Both of the latter techniques rely on the rela^ 
tion between the interfering variable and the dependent variable 
vithl^ es£h Such a relation hould be found (except for 

chance error) if and only if the interfering variable is 3 verting 
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an influence on the decisions of the admission committee, in 
vhich cn^e it is reasonable to predict that if a grouri's 'mean 
score on the interfering variable were higher, the group's 
admission rate also would be higher. When it is desired that 
adjut;tEent be made only for such a ' within-group" relation, 
the use of either balancing or analysis of covariance is always 
preferable to the use of indirect standardization. 

In general, indirect standardization seems to offer no 
advantages over balancing, but seems to suffer from several 
disadvantages. The only advantage indirect standardization 
has over analysis of covariance is that it allows for a non- 
linear relation between the interfering and the dependent var- 
iable, but balancing also makes this allowance. Thus, it seems 
that either balancing or analysis of covariance shoiUd be used 
to obtain adjusted scores. 
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Chapter V: "Saear and Sweep" Aualysis 

One of the seconia:^':/ objectives of this grant vas the investigation of 
other data analytic techniques used to adjust for nuisance confounding in the 
NAEP t.tudies. One such technique (and the only other "non-standard" technique 
of major consequence) is that ]imovn ab smear-and-sveep. The following chapt 
gives the hasic resxilts on smear-and-sve'^p and its relation to halancing and 
the nonorthogonal analysis of variance. 
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lii nany "behavioral or 5ocxal research situations, reseeurciiers may want 
tc estimate trea-tment effects or the relationships betveen input and output 
variables, while controlling for a number of extraneous variables. Some methods 
of analysis use the input and the extraneous variables to form multifactor 
classifications, and then estimate the treatment effects or relationships, 
adjusted for the effects of extraneous variables. A large number of variables 
available in the data may thus be selected to form multifactor cross-classifica- 
..tiqns, resulting in few observations per cell, indeed some cells in the 
crossed-classified may have no observations. For example, if 2500 sixth-grade 
students are involved in a study of educational progress, and later stratified 
into subgroups by region of the country (foizr levels), sex, race (three levels), 
type of community (seven levels), and parental education (five levels), they 
will be distributed over 8U0 cell combinations giving an average of about three 
observations per cell. With this many cells the data in each cell become too 
sparse to allow stable estimates of cell ralues , ana direct control on all ''^he 
extraneous variables by multifactor classification may, therefore, be impractical. 
Smear and Sweep Analysis 

One method developed for the ..^u. data resulting from the after-the^-fact 
classifications is the Smear-and Sweep analysis. This method first appeared in 
the report of the National Haloth^tne Study (Gentleman, Gilbeit & Tukey, 1969) 
in which the death rate of the patients in operations using Halothane 'as 
examined. It was later considered by the National Assessment of Educational 
Progress (NAEP) as a possible method to obtain sharper subpopulation weights 
(see Ahmann, 1973, pp. 108-109). 

Smear-and-sweep is a method that first pools the cells of the con+rol 
variables in a cross-classification table into categories in which the cell 
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values (e.g., propo^^iom, ratios, means) are reasonably similar, and then 
calculates and compares 'effects*' for an independent variable of interest 
across a final Ss:*t of categories. "basic strategy is to form a tvo-way 

table on two of the control -variables at a time. (This step is referred to 
as smearing,) The cell^ of this table are then ordered on a single dimension 
according to the value ot the dependent variable in the cells, value of 

this dependent variable Ve s±mi^\y the observed data. Least Squares estimated, 
or statistically adjust ^4 values. Then the adjacent cells are pooled into a 
smaller number of c^tegc^^^ies v-Lch define the levels of a new conglomerate 
variable. (This st^p i^ referred to as sweeping,) The process is then repeated 
by formi>-g another oias^if ic^tion table consisting of the newly formed conglo-- 
merate variable and ano-ther (Control variable. This proce3s is con'cinued until 
only a single conglomerate v^iable remains. The final conglomerate variable 
is then cross-classified witl^ the independent variable of interest. This tabl.^ 
is then used to comj^ute ^Harginal estimates and pe:-^orm some comparisons among 
the levels of the independents variable of ir.terest , using classical techniaues 
such as analysis of variance* v'. ...A-^ C^l 

The essence of thi^ method is that it permits the researcher to handle 
many known and availa'ble variables as control variables. The process of Smear- 
and-Sweep will presiAm^bJ-y cor^t-roi or minimize- the effects of extraneous variables, 
and thus allows better ^sti^ng^tes of the effects due to the independent variable 
of interest. By using t^^o variables at a time, thj niimber of observations in 
each cell combination m^y be large enough for stable estimates. 
An Illustration 

Smear-and-Sweep ar^aiy^is is illustrated by the following hypothetical 
data set, A probability saniple of 1,933 high school graduates were given a 
science test. Their i;e^t soo^es were scored either 1 (pass) or 0 (fail), 
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Suppose that, using this data set, a researcher was interested in testing 
ethnic group on the students' te3t scores after contro ^ for sex, region, 
socioeconomic (SES), and high school curricular program (HSP). The researcher 
could use a multif actor design, and apply analysis of variance to obtain 
adjusted marginal estimates for ethnic groups; however, in so doing, the 
observations within each cell combination would be very ;v^parse. Many cells 
would have two or three observations while some other cells would have none; 
the cell sizes might not be sufficient to provide stable estimates. 

Given these problems, the researcher chose to use a more "data analytic" 
approach, namely Smear-and-Sweep. The-'re&t-archer first cross-classified 
students on the basis of their socioeconomic background and high school pro- 
gram (HSP). SES had three categories: high, middle, and low in correspondence 
with upper quartile, middle two quart i les , and lower quartile of the SES 
composite scores, respectively. HSP was defined by college preparatory 
(academi::), -eneral and vocational-technical (voc-tech) programs. The propor- 
tion of pass for each cell combination was computed as follows : 



P = ^ 

where S^^ is the number of students who had a score of 1 in the i'th SES and 
j'th HSP, and N^^ is the total nmbsr of respondents in this cell. 

The obtained proportions were then ordered, and their corresponding cells 
were grouped into five categories as indicated in Table 1. The criterion for 
grouping was that the range of proportions in each category sho\ild not exceed 
.05. These five categories comprised a new "conglomerate" variable; each 
category incorporating some "effects" due to SES and HSF. 
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Table 1 

Proportion of Pass for SES and HSF 
Cross-Classification Groups 



SES 


nign bcnoox 


Proportion 




Program 


of Pass 


Category 


High 


Academic 


.82 


1 


Middle 


Academic 


.67 


2 


High 


General 


.56 


3 


Low 


Academic 


.55 


3 


, High 


Voc-Tech 


.3U 




Middle 


General 


1 .32 


k 


Middle 


Voc-Tech 


.22 


5 


Low 


General 


.22 


5 


Low 


Voc-Tech 


.17 


5 



The newly formed conglomerate variable vas then cross-classified with 
four geographic regions and resulted in a four-by-five classification table. 
The cell values (i.e. , proportions) of this table were calculated, and the 
cells were grouped in accordance with the rules described previously. The 
results are presented in Table 2. As seen in the table, the new conglomerate 
variable included seven categories as indicated by the number in the parentheses, 

Table 2 

. . Proportion of Pass for Region and the 
First Conglomerate Variable Combination Group 



Region 



Conglomerate Variable 

2 3 k 



Northeast 
North Central 
South 
West 



.8It(l)» 
.82(1) 
.82(1) 
.80(1) 



.70(2) 
.65(2) 
.61^(3) 
.67(2) 



.60(3) 
.52(5) 
.56(1^) 
.58(1^) 



.26(7) 
.30(7) 
.36(6) 
.3l*(6) 



.26(8) 
.21(8) 
.19(8) 
.30(7) 



*The figures in parentheses denote the levels of newly formed variables. 
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Similarly, the second nevly formed variable was then crossed vith two sex 
groups to for a tvo-by eight classification table. The proportion of pass in 
each cell combination is presented in Table 3. Again, the proportions were 
ordered, and their corresponding cells were grouped into categories, as indi- 
cated by the number in the parentheses, based upon the same criterion that the 
range of proportions in each category should not exceed ,05. 



Table 3 

Proportion of Pass for Sex and the 
Second Conglomerate Variable Combination Group 



Sex 


1 


2 


3 U 




6 


, 7 


8 


Male 


.82(1)* 


.69(2) 


.62(3) .53(l^) 


.53(1*) 


.3H3) 


.30(6) 


.20(7) 


Female 


.83(1) 


.67(2) 


.6U(2) .60(3) 




.36(5) 


.28(6) 


.18(7) 



*The figures in parentheses denote the levels of nevly formed variable. 



The last newly formed conglomerate variable was then cross-classified 
with ethnic group, which was the independent variable of interest. There were 
fo\ir ethnic groups: black, white, Hispanic (Spanish American), and others. The 
resulting four~by- seven table and its cell values are presented in Table k. 
The last coliomn of the table presents the adjusted average of cell proportions 
for each ethnic group. No substantial differences among ethnic groups were 
revealed, although whites had a slightly lower proportion than other groups. 
It should be noted, however, that these adjusted estimates were quite different 
from unadjusted ones. Had the proportions been estimated without controlling 
for sex, region, SES and HSP, the estimates would have been .36, .^7, .38, and 
.^2 for blacks, whites, Hispanics and others, respectively. Whites would have 
had a much higher proportion of pass than blacks. 
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Table \ 

Proportion of Pass for Race and the 
Third Conglomerate Variable Combination Group 



Ethnic 
Group 


1 


2 


Conglomerate Variable 
3 1* 5 


6 


7 


Adjusted 
AveraRC 


Black 


.87 


.66 


.59 


.38 




.U3 


.23 


.52 


White 


.82 


.67 


.61 


.55 


.31* 


.28 


.17 


M 


Hispanic 


.79 


.65 


.60 


.57 


.1*0 


.29 


.2k 


.51 


Other 


.89 


.7U 


.65 


.51 


.39 


• 25 


.18 


.52 



It is seen that the entire process of smear-and-sweep requires the 
selection of classification variables, and the following guidance functions: 
(l) the order in vhich the classification variables to be presented in the 
analysis, and (2) the criterion for cell pooling. The pooling criterion may 
be that each category contains (l) an approximately equal number of pass or 
fail, (2) an equal number of sample members, (3) equal variance of estimated 
cell values (Gentleman, Gilbert, & Tukey, 1969, p. 289), or (1*) equal range 
of cell values. Once the guidance functions are sufficiently determined , the 
computational procedures become straightforward. 

It should be noted that the cell values in the previous cross-classification 
tables were estimated simply by using the observed data. Other estimating 
procedures are possible. For exejnple, one might use the formula 

„ 3.jn ijn 

p = ii 

n • 

where W^^^ is the sample weight for the n'th individual in the ij 'th cell, and 
^ijn individual's score, either 1 or 0, 1 being pass, 0 being fail. 
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Some Considerations to Sraear-'and-SveeiD Analysis 

AlthoTOgh smear-and-sweep has been, applied to the analysis of the National 
Halothane Study (Gentleman, Gilbert, & Tukey, I969), no proof of the stability 
and accuracy of estimation has been given. Many giuestions involving the choice 
of guidance functions such as the number of categories, and the order of the 
classification variables introduced into the process, are unanswered. Among 
snch questions J the following ones are considered critical: 

1. Does the number of categories selected affect the stability and 
accuracy of the estimates? 

2. Does the order of treating the interfering variables affect the 
estimation of the effects of the independent variable? 

3. How do the results obtained by smear-and-sveep differ from those 
obtained by classical ANOVA? 

To answer these questions, three sets of hypothetical data were constructed. 
Each set of data was derived by using the following four-factor main^-effect 
model: 

^ ^ ^ ^i ^ ^0 ^ ^ ^£ ^ hiKl 
in which y = , T,a^ = SBj = Ey^ = = 0, and ^^^^i n(0,l). 
This additive model was selected for its simplicity. If smear-and-sweep does 
not work in such a simple model, it will very likely fail in a more complicated 
non-additive model. 

In the four-factor model, the first factor (independent variable), denoted 
by A, is the variable of interest. A has two levels; thus, the estimates of 
effects for A^ and A^ are of main concern. The other three factors, designated 
by B, C, and D, respectively, are referred to as interfering variables. All 
these variables are assumed to be associated with the dependent variable. 
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The error coniponents for observations in each cell vere chosen to be 
normally distributed with a mean of 0 and a standard deviation of 1, and were 
generated xising a standard method (Box & Muller, 1958). The main effects for 
all factors in the analysis were fixed at the values presented in Table 5. 
These values represent differences ranging from four to one-tenth standard 
deviations apart. 

Table 5 

Main Effects Selected for Each Set of Data 







Data Set 




Level 


I 


II 


III 


^ i 


2.00 
-2.00 


1.50 
-1.50 


1.00 
-1.00 




-1.50 
1.50 


-1.00 
1.00 


- .50 

- .50 




1.00 
-1.00 


-50 
- .50 


.10 
- .10 


- I 


.50 
- .50 


.10 
- .10 


.05 
- .05 



The cell frequencies (i.e., number of observations in each of the cell 
combinations) are not equal, reflecting situations likely to be confronted in 
actual studies. These frequencies, as presented in Table 6, vere arbitrarily 
chosen, vith only the restriction that there be sufficient degrees of freedom 
for testing any main effect. 

A. Number of Categories, 

In the sveeping process, a critical question is: Hov many categories 
should one use? It has been suggested that a relatively large number of cate- 
gories vould be preferred (Gentleman, Gilbert & Tiikey, 1968, p. 296). However, 
results in the National Halothane Study and the National Assessment of Educa- 
tional Progress did not show a significant difference resulting from the number 
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Table 6 
Cell Frequencies 



Cell 

A B C D Frequency 

1111 5 

1112 U 
112 1 3 
112 2 2 
12 11 2 
12 12 3 
12 2 1 U 

1 2 2 2 5 

2 111 3 
2 112 2 
2 12 1 \ 
2 12 2 5 
2 2 11 k 
2 2 12 3 
2 2 2 1 5 
2 2 2 2 _2_ 

Total 56 



of categories . This may be due to the fact that the cell values in those 
studies were so homogeneous that different grouping processes would not iDe 
sensitive enough to affect estimates for each category. Nevertheless, differ- 
ential effects resulting from various numbers of categories were investigated 
with the following procedures. 

First , factors B and C were smeared and swept into a new variable with 
four categories. This new variable was then smeared over factor D and resulted 
in a two by four table. The estimated cell values were ordered and are presented 
in Table 7- 
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Table T 
Estimated Cell Value 



Data Set 


1 


2 


3 


Cell 
1+ 


Order 
5 


6 


7 


8 


I 




1.932 


.505 


.063 


-.021+ 


-.1+66 


-I.89I+ 


-2.336 


II 


1.332 


.971+ 


.1+63 


.105 


-.066 


-.1+21+ 


- .936 


-I.29I+ 


IV 


.U82 


.1+13 


.081+ 


.021+ 


.011+ 


-.01+5 


- .371+ 


- .I+I+I4 



The ordered eight cells vere svept into categories, starting from the cell 
with the highest value, in accordance vith each of the following criteria: 

1. Cells vith positive values vould he swept into one category, whereas 
those vith negative values vould he swept into another. 

2. The range of cell values within each category would be less than .k5 . 

3. The range of cell values vithin each category wo\ild be less than .30. 
h. The range of cell values vithin each category would be less than .05. 
The n-umbers of resulting categories formed for each data set are presented 

in Table 8. 

Table 8 

Number of Categories Fonned Under Four Criteria 



Data Set > 


1 


Criterion 
2 3 


1+ 


I 


2 


U 


T 


8 


II 


2 


k 


7 


8 


III 


2 


3* 


3 


7 



*This classification was not used in the subsequent analyses 



The independent variable A was then cross-classified vith each final 
conglomerate variable to form a tvo-vay classification table. Analysis of 
variance was then conducted for this two-way classification table, and the 
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adjxisted marginal means for and Ag were computed with an additive model. 
The difference between the two levels, as contrasted with those expected true 
differences vsee Table 5) and thoss estimated by miUtif actor ANOVA, are presented 
in Table 9. Some smear-and-^sweep estimates (e.g., those obtained by two cate- 
gories) are as close to the expected differences as those obtained by multifactor 
AllOVA. The number of categories does affect estimates of the differences. For 
data set I, the more categories used, the smaller the difference between A^ and 
Ag, and the greater the deviation of the estimated difference from the expected 
value. This finding contradicts the suggestion that a larger number of cate- 
gories be used (Gentleman, Gilbert & Tukey, I969, p. 296). However, this 
finding of the number of c^ategories being negatively related to the magnitudes 
of the estimates is not necessarily supported by resialts from data set III, for 
which the seven-category estimate is closer to the expected than the three- 
category estimate. It is, therefore, not clear how systematically the choice 
of the number of categories can affect the precision of estimation. The authors 
suspect that the effects may fluctuate randomly. When the right number of 
categories is "hit," the estimates obtained by smear-and-sweep analysis can 
be as good as those by AJNOVA or other methods. 
B. Order of Variables 

It has been argued that the order of the presentation of the variables 
might be analogous to the step-wise regression analysis in which the most 
important variable should be introduced first (Gentleman, Gilbert, & Tukey, 
1969, p. 295). Previously, however, no systematic examination of this argioment 
has been conducted. It is, therefore, the piirpose of this portion of the study 
to explore the order effect of variables in the smear-and-sweep process. 
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Table 9 

Difference Betveen and Ag 
As Obtained from Vario\is Analyses 



Analysis 


I 


Data Set 


III 


Smear and Sweep 








AxN(2)* 


^.039 


3.025 


1.965 


AxN(3) 






l.i+33 


AxN{k) 


3-799 


2.871 




AxN(T) 


3.76U 


2.71^7 


1.760 


AxN(8) 


3.762 


2.762 




Factorial ANOVA 


3.938 


2.938 


1.938 


Expected 


k.QOO 


3.000 


2.000 



**rhe nuittber in the parenthesis indicates the hmnber of categories 
for the final conglomerate variable. 

For the same design and data used^in the previous section, three possible 
orders of variable presentation were investigated. They are: 
(jl) B, C, D (the same as C, B, D), 

(2) D, B, C (the same as B, D, C), and 

(3) C, D, B (the same as D, C, B). 

The alphabetic order of B, C, and D indicates the order of importance of these 
variables in terms of the magnitude of their effects (see Table 5)- 

Estimated differences between A^ and Ag from data set I under two cell- 
pooling criteria are presented in Table 10. The results do not support the 
argument that the most important variables should be introduced first. Results 
from the other two sets of data also failed to provide positive evidence. It 
seems that what makes estimates different is not the order of variable presenta- 
tion but the resulting number of final categories. 
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Table 10 

Estimated Difference Between A-^ and 
With Three Orders of Variable Presentation for Data Set I 



Order of 
Presentation 


Cell-Pooling Criterion 

Range less Range less 
than .05 thAn .?n 


B, C ^ D 


3.762(8)* 




B, D ^ C 


3.762(8) 


3.76U(7) 


C, D ^ B 


3.760(6) 


3.799(i*) 



*The figixres in parentheses indicate the number of categories of 
the final conglomerate variable. 



C. Comparisons on Test Statistics 

Analysis of variance may he applied to the final cross-classification 
tahle to test the significance of treatment effects or group difference 
(Gentleman, GillDert, & Tukey, I969). The question then is: To what extent 
vill the results obtained by smear-and-sweep differ from those obtained by 
a factorial analysis of variance if the data permit the latter analysis? 
To answer this question, nonorthogonal analysis of variance (ANOVA) for a 
factorial design was performed on data used in previous sections to obtain 
test statistics for A eliminating B, C, and D (A|B, C, D); namely, xinconfounded 
test of A (see Appelbaum & Cramer, 1973). Nonorthogonal AWOVA was also conducted 
on the final two-way cross-classification table resulting from the smear-and- 
sweep process, with A as one dimension ag.d the newly formed variable as another 
dimensibdl ^nlt should be noted that thejorder of control variables in'croduced 
into the smear-and-sweep process was B, C, then D. ) 

The mean squares and degrees of freedom for each test are presented in 
Table 11. It can be seen that in the smear-and-sweep analyses, between-group 
variance decreases, as expected, as the number of categories of the final 
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conglomerate variable increases. The vi thin-group variances, however, fluctuate. 
Converting the variances into F statistics, all of them aore significant at the 
.01 level vith their associated degrees of freedom. As far as significance 
testing is concerned, smear-and-sveep provides results similar to factorial 
analysis of variance. However, smear-and-'sveep analysis may provide a more 
conservative test. Comparing a|b, C, D, and a|k{8), for example, both designs 
produce the same magnitude of error variance, same degrees of freedom for A 
effects, but their betveen-group variances are quite different; A|N(8) has much 
smaller betveen-group variance than a|b, C, D. It is possible when A effects 
are small, that the a|n(8) may provide test statistics indicating non-significant 
A effects while a|b, C, D indicates significant differences. 
Summary and Discussion 

Smear-and-sweep analysis is a method to conrpute simnaary statistics such 
that the effects of interfering variables are reduced or controlled. The basic 
strategy is to pool cells of similar values into categories. It involves the 
following steps: (1) forming a two-way classification table (i.e., smearing) 
and estimating cell values; (2) fonaing categories based on cell values (i.e., 
sweeping); and (3) comparing the values among levels on the interested independent 
variable across the final set of categories. 

Since its development and application in the National Halothane Study, it 
has received little systematic evaluation. This study shows that the precision 
of the sxmimary statistics depends very much on the choice of the nxmi'ber of cate- 
gories; however, it seems that it is not always preferable to have a large number 
of categories. The investigation does not support the argument that the greater 
the number of categories, the better the estimates. Furthermore, the choice 
of the categories has not yet been systematically defined. Cluster analysis 
could be an alternative to sequential two-way aggregation. Further investigation 
is warranted. 
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Tl^e suggestion of introducing the most important interfering variable 
first in the process is also not supported. The order' itself does not seem 
to be a determinative factor in the precision of estimates. It is the number 
of resulting categories that affects estimates. Once the number of categories 
of the final conglomerate variable is selected, the order of variable presenta- 
tion does not seem to be critical. However, it should be noted that the order 
of presentation may very likely determine the selection of the number of cate- 
gories . 

The results of the investigation also shbw that smear-and-sveep tends to 
provide a conservative significance test as compared to factorial analysis of 
variance. When data are sparse, smear-and-sveep is an alternative method that 
may lend some strength to stable estimates, and explore the treatment effect or 
possible relationships between classification and dependent variables . 
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Chapter VI: A Coigparison of Balancing and Analysis of Covaxiance 
in the Adjustment of Educational Data , 

Female and male admission graduate programs at the 

University of North Caroline jj. are compared for 1972-73 ar 

1973-7^« To assess possible se-r. ic-Luued bias in admission, rates are 
adjusted for applicant qualifications by analysis of covariance and by 
balancing. 

The adjusted admission rates reflect, in one case, i.e., for one 
program and one admission year, a slight advantage for male applicants 
over females, while in three cases, female applicants were granted a 
slight advantage over males in admission. In the remaining four cases, 
there is no evidence that sex of applicant, per se , played a role in 
admission decisions. Wherever a sex-related advantage is detected, the 
favored sex is that with the fewer applicants to the program. 
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The dependent variable in this study, defined for each 
applicant for a given program and enrollmisnt year is 



^•4 = 



1, if admitted 
0, if rejected , 



vhere j=l if the applicant is female and j=2 if the applicant 
is male, and i=l, 2, nj is r - number of female (or male) 

applicants to the program for that year. Then 

is the mean Y^j for sex J , and also represents the proportion 

of applicants of sex J vho vere admitted. The P values are 

J 

the female and male admission rates presented in Table I. 

Given in Table I are the unadjusted rates of admission by 
sex, and in Table II the vi thin-group correlations of interfering 
variables vith admission 
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TABLE I 

Graduate Admission Rates by Sex for 1972 and 1973* 



r xexa 


Ye air 


Female 
Admission 
Rate 


Number of 

Female 
Applicants 


Male 
Admission 
Rate 


Number of 

Male 
Applicants 


English 


1972 


32.1 


165 


3l».0 


235 




1973 


20.3 


153 


25.0 


20U 


Histoiy 


1972 


68.2 


kk 


55.5 


182 




1973 


5U.5 


55 


U8.0 


177 


Library 




- u 


175 


70.2 


57 


Science 


973 


1.0 


157 


50.0 


ho 


Sociology 


1972 


31.6 


38 


18.2 


66 




1973 


22.6 


31 


11.6 


69 



♦Excluded from tbe tatle are all applicants for vhom less than 
complete data vere available from the set of undergraduate 
grade point average, ORE scores, and tvo letters of recommen- 
dation . 



TABLE II 

Point-Biserial Correlations of Qualification Variables 
with Admission for Female (F) and Male (M) Applicants 



Field 




GPA 


GRE V 


^GP.E Q 


GRE 


Adv 


REC 


Year 


F 


M 


F 


M 


F 


M 


F 


M 


F 


M 


English 


72 
73 


.29 
.37 


.U2 

.27 


.16 

.28 


.U6 
.29 


.18 
•15 


.3U 
.27 


.20 
.25 


.33 
.35 


.26 
.17 


.22 
.19 


History- 


72 
73 


.52 
.38 


.52 
.36 


.56 
.62 


.U3 
.5i* 


.2k 
.k3 


.U3 
.U7 


.39 
.lU 


.33 


.k2 
.k2 


.k9 
.35 


Library 
Science 


72 
73 


.35 
.U3 


.1*2 
.38 


.50- 
.38 


.06 
.55 


.30- 
.U8 


•.07 
.65 






.39 
.33 


.29 
.U7 


Sociology- 


72 
73 


.27 
.1*8 


.1*6 
.12 


.51 
.16 


.11* 
.3k 


.67 
.06 


.15 
.35 


.60 
.37 


.16 
.26 


.07 
.39 


.32 
.15 
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Also of interest are the mean differences between, the sexes 
on the interfering variables, given in Table III. These are 
computed by subtracting the mean for female applicants from the 
mean for male applicants, so that a positive mean difference 
represents a male advantage and a negative mean difference 
represents a female advantage. The unit, in each case, is that 
in which each variate is naturally recorded. 

Somewhat more informative are the standardized mean differ- 
ences , presented in Table IV. Here, each male-female mean dif- 
ference is divided by the standard error of the mean difference. 
Each value in Table IV represents a t_ statistic. Those values 
whinVi -^if r. j^Qjji zero by approximately two or more are judged 

7.eTo sufficiently to rei resent a statistically 
s.. . , icant difference between sexes. The resxilts of Table IV 
represent the values of 



t 

m 




The index m identifies the covariate, as defined in Table V; 

s is the within-sex standard deviation f -c that covariate: 
m ' 

iL^ indicates the immber of applicants of J - 

Inspection of Tables III and IV is ructive. W2;thout 
exT^ption, the mean grade point avereige f->. women applicants 
is ^gher than that for males for each prcog^am and each year. 
C-:i:.^RE scores, women applicants show highe- nean scores irhan 
nr^ie applicants on the verbal test (except jr applicants to 
tae Department of English), while males she , higher means than 
females on the quantitative test, and also (with the exception 
of Sociology applicants in 1973) display higher meaa scores on 
the advanced test. For each program and each year, t:he mean 
summary score derived from letters of recommendation is higher 
for males than for females. The mean differences on GRE-Q for 
male and female applicants to English is extraordinarily large, 
more than TO po2.nts ±r: both years (Table Hi), with highly 
significant t^ statist±as , 7.1 and 6.6 (TaSile IV) . 

A, comment is in order concerning the consistent advantage 
of nale applicants on mean level of recommendation for graduate 
srizzy (Tables III and IV) , Especially since it contrasts with a 
fessale advantage on grade point average and (usually) on GRE-V, 
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TABLE V 

Variables Pertaining to Admission Qualifications 



Variable 


The Nature of the Variable 


^1 


Undergraduate grade point average for final two 
years (GPA) 


^2 


Verbal score on the Graduate Record Examination 
(GRE-V) 




Quantitative score on the Graduate Record 
Examination (GRE-q) 




Score on the Advanced Test, Graduate Record 
Examination (GRE-a) 




Mean recoamxendation (vith each coded 0-1*) 
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It is possible that this represents a bias toward males on the 
part of those who recodmend applicants, who most frequently 
are male faculty members. However, the recommendation is 
couched in terms of the probability that the candidate will 
successfully complete a doctoral program; the apparent male 
advantage could be the result of possibly valid judgments that 
women have been more likely than men to discontinue graduate 
work before receiving the Ph.D. 



TABLE VI 

Comparison of Unadjusted and Adjusted Female (F) 
and Male-Female (M-F) Admission Rates 



Field 




Unadjusted 


Year 










F 


M-F 


JInglish 


1972 


32.1 


1.9 




1973 • 


20.3 




History 


1972 


63.2 


-12.7 




1973 


51*. 5 


- 6.5 


LiTarary 


1972 


59.1* 


in.6 


Science 


1973 


51.0 


- Z-.O 


Sociology 


1972 


:i.6 


-13.1* 




1973 


22.6 


-11.0 



Adjusted 
■ ANCOVA Balancing 



F 


M-F 


F 


M-F 


33.5 


- .1* 


31,9^. 


2.3 


23.0 


- .1 


22.6" 


.7 


67.9 


-12.3 


68.3 


-12.8 


1*8.9 


.9 


1*6.3 


1*.3 


59.6 


10.1 


59.0 


12.1* 


50.6 


.8 


50.0 


3.8 


29.2 


- 9.7 


30.1* 


-11.5 


21.9 


-10.0 


22.3 


-10.6 



A reasonable indication of apparent favoritism toward 
males in admission to graduate study is provided by the jnale- 
female difference in admission :xate. For each program and 
each year, this difference is compared in Table ^ with the 
jnale-female difference after adjustment by analysis of crovari- 
ance and adjustment by balancing for the sex differences on 
all five interfering variables. The adjusted differences in 
admission rate using analysis of covariance are plotted against 
the unadjusted differences in Figure 1. 

From Table VI or Figure 1, several conclusions follow. 

First, neither the covariaiEzre adjustment nor the balancing 
a^'jTistment radically changes the impressions gained from as- 
ssssing unadjusted differences in admission rates for men and 
kronen. In the most extreme cases. History in 1972 and Sociology 
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in both years appeared to favor femaile applicants, and the 
appearance applies to adraission rates following adjustment; 
Library Science in 1972 appeared to favor male applicants 
and, again, the adjustment leads to no different appearance. 

A somewhat different conclusion does arise, however, re- 
garding the influence of sex of applicant upon admission policy 
in History for 19T3 and in English for 19T3. The adjusted re- 
suilts for History, 19T3, suggest that the apparent favoritism 
of female applicants may have been ^ cor" /ale-female 
differences in scores on the covariates. While the unadjusted 
rates favored women by 6.5 percent, adjusted rates actually 
favor men by .9 percent from MCOVA, and by U.3 percent from 
balancing. For English, 19T3, unadjusted rates suggested a 
tendency to favor malss slightly over females in admission. 
&fter adjustment, tiiere are only negligible differences between 
mals and female ratss. 




-10 -5 0 5 10 

Unadjusted Difference 



FIG. 1 

Kale-female Differences in Admission Rates: Adjusted Differ- 
ences (by Analysis of Covariance) vs. Unadjusted Differences 
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The fourth and sixth columns of Table VI suggest that, 
for one program in one year. Library Science in 1972, male 
applicants were accepted more frequently than female applicants 
for reasons other than differences in grade point nvera^e 
™ °^ strength of recommendations. Fcr - story 'i^ 
1972 ana for Sociology in both years, female app .nts apr \t 
to have been granted a similar . . vantage. In the : ...axning 
four cases there is no evid^uoe that sex of applicant, per se, 
played a role m admission decisions. 

DISCUSSION 

Adjustment techniques as used in thl^ study provide an 
answer to the question of vhat female anfcmale admission rates 
to graduate study might have been if the ±w6 sexes had pre- 
sented equal qualifications on a set of interfering character- 
istics, as these characteristics were used by the admission 
committee to select applicants . Interpretation of adjusted 
scores must be tempered by the realization that these adjust- 
ments occur under the admission committee's definition of 
qualification. 

Adjusted rates provide some clue as to what male and fe- 
male acceptance rates might have been if males and females had 
had the same distributions for the interfering variables . It 
IS most meaningful to examine adjusted rates in conjunction 
with unadjusted rates, which represent how often females and 
males in reality were accepted. Adjusted acceptance rates pro- 
vide more information concerning the fairness of the admissions 
committee, but when examined in conjunction with unadjusted 
rates they also provide information concerning the differential 
qualifications of males and females. A large difference be- 
tween unadjusted rates and rates adjusted for a particular 
characteristic suggests that the average score is quite differ- 
ent for male and female applicants, and also that the committee 
considers this difference to be important. Such a pattern of 
scores may provoke interest as to why the applicants of one 
sex are more qualified than those of the other on the average 
and^also as to why the committee considers this characteristic 
to be important in defining qualification. For example, fe- 

«n^^rp?'v^ ^° .departments seem to have higher 

CPA and GRE-V averages but lower GRE-Q, GRE-ADV, and EEC. 
averages than male applicants in the corresponding departments . 
Xt would be interesting to investigate how uniform this pattern 
is among applicants to other departments at this u^niversity 
and among applicants to other graduate schools. 4->en if ad- 
Justed male and female rates are approximately th-- same, fur- 
ther investigation :nay be desirable if unadjusted rates are 
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very differen** f other, since this -niests the rjv 

applicants of ^ ' re much more qualif ,an those of 

the other. For ej. , •aZ.idity study udgat be conducted 

to insure that interfering characteristics are being used 
fairly, or a study might be done to determine why highly qual- 
ified persons of one sex but not of the other are motivated to 
apply to the department. 

In this investigation of admission of applicants to four 
graduate programs, only modest differences vere observed be^ - 
tveen admission rates for females and males, sometimes favoring 
one sex and sometimes favoring the other. After adjusting for 
sex differences in undergraduate grade point average. Graduate 
Record Examinations scores, and recommendations, some of these 
differences remained (three favoring females, one favoring 
males), vhile others disappeared. The study illustrates the 
appropriateness of adjusting admission rates before drawing 
conclusions concerning sex differences in admission to gradtaate 
study . 
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Conclusions 

A recurring problem in educational research (and indeed in social, beha- 
vioral, and medical research) has been the adjustment of data to account for 
initial differences among observed groups of individuals on attributes xincon- 
trol3^1e by the researcher. Unlike . the experimental solution introduced by- 
Fisher — randomization — which typically cannot be employed in the educational 
setting, the majority of "solutions" employed by educational evaluators have 
been res sent i ally statistical or data analytic adjustments. While the use of 
this ..class of techniques is by no means new, little in the way of systematic 
investigation of their nature or relation to other statistical techniques 
emerging from the Fisherian tradition has been xxndertaken. ^ 

-Tn such nationally important research \indertakings as the NAEP studies 
of edui^ational progress, it was appropriate to employ such techniques, still 
without a detailed understanding of their nature. Chief among these techniques 
was that known as balancing, defined for situations in which the basic data are 
proportions of successes in the cells of a multiply classified table, usually 
with unequal numbers of basic observations in the several cells. It was, at 
the outset, known that simple comparisons of raw proportions would lead to 
confounded results and hence it was appropriate to employ a technique which 
could potentially untangle the various influences which exhibited themselves 
in the data. 

Balancing is not, however, the only technique that has been proposed to 
accomplish this end. Techniques such as direct and indirect standardization, 
"smear-and-sweep," and the analysis of covariance have all been employed at 
various times for similar purposes . 
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It was the aim of the research herein reported to develop a better under- 
standing of the nature and similarities of these techniques, thereby to make 
possible a greater appreciation of their implications for applied research. 
As it has turned out, there is, for many of these techniques, a single unifying 
approach, that of the nonorthogonal analysis of variance. By viewing them in 
terms of this type of analysis, unexpected insights into their nature were 
found. In order to accomplish this, however, more needed to be understood 
about the nonorthogonal analysis of variance and hance a substantial portion 
of the activities of this investigation was spent on a detailed study of this 
technique . 

It was found that the nature of the nonorthogonal analysis of variance 
can be understood by viewing it as a comparison of conrpeting models with the 
role of significance testing being simply the means for selecting the best of 
the competing models. It was found that both ignoring and eliminating tests 
were jointly necessary to accomplish these ends and that it is not always 
possible to iielect a single best model (i.e., there is the possibility, albeit 
rare, for an ambiguous result). Of Importance for the later insights into 
adjustment techniques were the results on estimation which follow the selection 
of the "best** model, particularly those results which bear upon marginal means 
and the concept of weighting. 

With the results from the study of the nonorthogonal analysis of variance 
firmly established, we looked more closely at the several adjustment techniques. 
As had been speculated, it was possible to show that if one defines success or 
failure as a binary random variable, the equations of the balancing method are 
identical to those that define the nonorthogonal analysis of variance in a main 
effects model. The result, of this equivalence is that one can, with some care, 
use standard' ANOVA programs to perform balancing; consequently, the large body 
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of literature concerning analysis of variance can directly applied to the 
"balancing situation. Of greater importance for interpretation, however, is 
the virtual identity "between the "balanced estimates and the estimated marginal 
means in nonorthogonal ANOVA. This identity led us to explore the various 
types of weighting schemes for marginal means and to conclude that, in a more 
general, and possitly interactive context, one needs first to adopt a linear 
model which accurately reflects the population from which the data were obtained. 
Following the selection of the appropriate model (a significance testing prohlem) 
and the proper estimation of parameters in that model (an estimation prohlem 
independent of the significance testing problem) , the weights are then chosen 
as a function of the use to which the marginal means are to he put. Balancing, 
: 'ach inherently implies a main effects model, has heen used to compare groups 
as if the groups were comparable on other variables . This necessarily implies 
estimation in a main effects model followed hy weighting with singly subscripted 
weights. If, however, one were to decide that an interactive model was more 
appropriate (by use of the nonorthogonal ANOVA, for instance), one could esti- 
mate cell means in that model and then again use singly subscripted weights to 
draw the same type of conclusions but under a rather different model of nature. 

Direct standardization can be viewed in a similar way. Since direct stand- 
ardir.ation is based on observed cell means (estimates from an interactive model) 
the results of direct standardization must diffar from those of balancing (a 
main effects model) when interactions are present. Standardized estimates 
can also be obtained by estimation in an interactive model combined with the 
use of singly subscripted weights based upon the proportion of cases in the 
"standard" populations. Indirect standardization, however, does not fit this 
type of models and may give different results from balancing, even when no 
interaction is present. 
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Adjustment "by ajialysis of covariance is similar to both balajicing and 
direct standardization cilthough on its face it appears to provide a different 
type of adjustment. If one considers a mviltifactor main effects ANOVA design, 
where for a particular factor one includes only the linear component in the 
model, the estimated cell means are identical to what would be obtained, had 
a coveiriate been used in place of the factor. In this special and somewhat 
limited case the balanced estimates will be identical to estimates adjusted 
for a covariate. One could as well generalize this result to interactive 
models, so that we have a class of adjustment procedures which are essentially 
equivalent, differing primarily in the choice of the appropriate linear model. 

The choice between balancing, direct standardization, and analysis of 
covariance is necessarily dependent only upon which provides the most appro- 
priate linear model. In fact, none of them may provide a parsimonious model, 
and we think it preferable to think of choosing the correct model in the more 
general context of nonorthogonal ANOVA with these special cases providing 
frequently chosen options. Indirect standardization would seem to be a less 
preferable choice- 

The smear-and-sweep procedure differs markedly from the above procedures in 
that it is comparitively ill-defined and arbitrary; there is no well-Justified 
rule for deciding the order in which classification variables are to be selected 
and how cells should be pooled. Our investigation suggests that the nuinber of 
categories may substantially affect the estimated effects while the order of 
variables has a considerably smaller effect. In view of the arbitrariness 
involved, we can see little justification for the use of the smear-and-sweep 
procedure to meet the purposes that also may be served by balancing or analysis 
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In smmnary, a number of the adjustment techniques en5)lpyed for the purpose 
of adjusting for initial differences among observed groups are closely related 
through the more general nonorthogonal analysis of variance. In general these 
techniques are actually a combination of three rather distinct processes: the 
determination of an appropriate linear model, the estimation of parameters, and 
the combining of estimates by a weighting scheme. Each technique (save smear- 
and-sveep) employs a particiaar combination of these, usually prescribed before 
the fact. A detailed understanding of how each operates relative to these 
processes then allows for a "better understanding of its basic nature. 
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